We address the problem of human action understanding of the upper human body from video sequences. Timesequential images expressing human actions are transformed to sequences of feature vectors containing the configuration of the human body. A human is modeled as a collection of body parts, linked in a kinematic structure. The relation of the joints is used to estimate the human pose. A proposed layered HMM framework decomposes the human action recognition problem into two layers. The first layer models the actions of two arms individually from low-level features. The second layer models the interrelationship of two arms as an action. Experiments with a set of six types of human actions demonstrate the effectiveness of our proposed scheme, and the comparisons with other HMM systems show the robustness.