A Rotation Invariant Latent Factor Model for Moveme Discovery from Static Poses

We tackle the problem of learning a rotation invariant latent factor model when the training data is comprised of lower-dimensional projections of the original feature space. The main goal is the discovery of a set of 3-D bases poses that can characterize the manifold of primitive human motions, or movemes, from a training set of 2-D projected poses obtained from still images taken at various camera angles. The proposed technique for basis discovery is data-driven rather than hand-designed. The learned representation is rotation invariant, and can reconstruct any training instance from multiple viewing angles. We apply our method to modeling human poses in sports (via the Leeds Sports Dataset), and demonstrate the effectiveness of the learned bases in a range of applications such as activity classification, inference of dynamics from a single frame, and synthetic representation of movements.


I. Introduction
What are the typical ranges of motion for human arms? What types of leg movements tend to correlate with specific shoulder positions? How can we expect the arms to move given the current body pose? Our goal is to address these questions by recovering a set of "bases poses" that summarize the variability of movements in a given collection of static poses captured from images at various viewing angles.
One of the main difficulties of studying human movement is that it is a priori unrestricted, except for physically imposed joint angle limits which have been studied in medical text books, typically for a limited number of configurations [1], [2]. Furthermore, human movement may be distinguished into movemes, actions, and activities [3], [4] depending on structure, complexity, and duration. Movemes refer to the simplest meaningful pattern of motion: a short, target-oriented trajectory, that cannot be further decomposed, e.g. "reach", "grasp", "step", "kick". A complex gesture should be composed out of simple movemes: we define an action as a predefined and ordered sequence of movemes, such as "drink from a glass", or "open a door". An activity is a (possibly stochastic) combination of actions taking place over a stretch of time with a typical and yet variable structure, e.g. "dine", "read". Extensive studies have been carried out on human action and activity recognition [5], [6], however little attention has been paid to movemes since human behaviour is difficult to analyze  at such a fine scale of dynamics. 1 In this paper, our primary goal is to learn a basis space to smoothly capture movemes from a collection of two dimensional images, although our learned representation can also aid in higher level reasoning. Static poses extracted from two-dimensional images are the most abundant source of pose information. Thus, finding a basis representation using such training data can prove extremely valuable, given the number of image datasets (as opposed to video or mo-cap data) that are currently being collected with a focus on common activities [8]- [10]. However, such images are typically taken from a wide range of viewing angles, and can yield only two-dimensional projections of the underlying three-dimensional pose. Any method that does not directly address these issues will learn a naive representation that fails to provide a set of global three-dimensional bases poses that can capture pose changes due to the true human motion while disregarding those due to a change of the angle of view.
In this paper, we propose a simple but effective rotation invariant latent factor model that can recover a set of three-dimensional bases poses from a training set of twodimensional projections. Our approach is distinguished from previous latent factor modeling approaches by directly incorporating geometric operations in an integrated way, and yields interpretable three dimensional bases poses that can be easily visualized as well as manipulated to express a natural range of human poses (as depicted in Fig. 1). We applied our approach in a case study on modeling poses that arise in sports activities, since they have very characteristic and recognizable motions and typically share trajectories of parts of the body (e.g., tennis serve and volleyball strike), which allows to more easily interpret and evaluate qualitatively the learned movemes.
Our study is not purely academic, we have four applications in mind; in this paper we carry out a quantitative and qualitative analysis for two of them, and leave the study of the latter to future work. Activity recognition; a compact representation such as the proposed one can be used in addition to the feature representation of state of the art methods for activity recognition, favoring both the performance [11], and the interpretability of results. Action dynamics inference; modifying the weights of the learned bases poses is analogous to moving along a line in the high-dimensional space of human poses (either 2-D or 3-D). This allows to predict the future dynamic of an action [12], or morph a pose into another from a single frame, by observing the dynamics of the movemes which better describe the captured pose. Computer graphics animation; many animation systems are still based on key-framing and in-betweening [13]: master animators draw the key frames of a sequence to be animated and assistant animators complete the intermediate frames by inferring the movements occurring between the keys. Knowing the movemes underlying human actions would provide an automated method for interpolating between key frames, resulting in a faster and simplified animation pipeline. 3-D pose estimation; a sparse overcomplete dictionary of human poses has been used effectively for the reconstruction of 3-D human pose given its 2-D joint locations from a single frame image [14]- [16]. Our technique would allow to identify the most suited pose bases for a given collection of images without any experimenter bias, or the need of curating the angle of view of the images in the training set.
In summary, the main contributions of our paper are: 1. An unsupervised method for learning a rotationinvariant set of bases poses. We propose a solution to the intrinsically ill-posed problem of going from static poses to movements, without being affected by the angle of view.
2. A demonstration of how the learned bases poses can be used in various applications, including manifold traversal, discriminative classification, and synthesis of movements.

II. Related Work
Human Pose Analysis: There are two main directions of research for human pose analysis. The first one is estimation: given a picture containing a person, the goal is to predict the location of a predefined set of joints of its body, either in the 2-D image [17], [18] or in the 3-D space [14]- [16]. Methods for 3-D pose reconstruction build upon the results of 2-D pose estimators by using mechanisms based on physical constraints and domain knowledge to infer the true underlying human pose observed in an image, and are more of interest in this study since they implicitly learn an overcomplete basis for modeling human movement. However, such methods typically predefine the dictionary of actions, use additional data in the training phase (such as mo-cap), and do not treat explicitly the problem of varying angles of view. In contrast, our goal is to learn a low-rank manifold of 3-D poses consistent across multiple viewing angles, given only two-dimensional data.
The second line of investigation uses pose as a form of contextual information that can be combined with objects' category and location in an image to obtain higher performance for activity recognition through a joint learning procedure [19]- [21]. Our approach can as well be used as a feature representation for improved activity recognition.
From the perspective of pose analysis, the goal of this work is to learn a semantically meaningful representation of human pose that can model human motion. This representation should be independent of the application domain, and flexible, allowing it to be incorporated with other representations. Other people investigated this problem: it is known that dynamic information can be recovered from static images of humans engaged in activities [22], and similar representations for action recognition have been learned using video data [23], [24]. We are the first to propose a representation that directly treats the problem of rotation-invariance and can be learned only from static poses, which we believe is important since it is the most abundant form of data.
Latent Factor Models and Representation Learning: We build upon a long line of research in latent factor models, first popularized for collaborative filtering problems in content recommendation [25]. Applications include modeling variations of faces [26], document and text analysis [27], and behavior patterns in sports [28], amongst many others. Latent factor models are variants of matrix and tensor factorization, which can easily incorporate missing values or other types of constraints. In this regard, our work introduces an approach for learning a latent factor model in a high-dimensional space, when the observed training data are lower-dimensional projections. Our method is complementary to and can be integrated with other latent factor modeling approaches.
Our approach can be viewed as a form of representation learning, which includes methods such as deep neural networks and dictionary learning [29], [30]. One of the benefits of representation learning is the ability to smoothly traverse the representation space [31], which in our setting translates to learning movemes as transitions between poses.

III. Models
We develop our approach by building from the classical singular value decomposition. We characterize the challenge of learning only from lower-dimensional projections of the underlying feature space, and present a rotation-invariant latent factor model for dealing with such training data.

A. Basic Notation and Framework
In this paper, we focus on learning from two-dimensional projections of three-dimensional human poses, however, it is straightforward to generalize to other settings. We are given a training set S = {(x j , y j )} n j=1 of n two-dimensional poses, where x and y correspond to the image coordinates of the pose joints from the observed viewing angle, see Fig. 2. Let S ∈ 2d×n denote the dataset matrix, where 2d is the dimensionality of the projected space (twice the number of joints d for two-dimensional projections). Our goal is to learn a bases poses matrix U ∈ 2d×k composed of k latent factors, and a coefficient matrix V ∈ k×n , so that every training example can be represented as a linear combination: wheres denotes the "mean" pose. Of course, (1) does not deal with rotation invariance and treats the x and y coordinates as having the same semantics across training examples. We present in Sec. III-C a rotation-invariant latent factor model to address this issue and recover a three-dimensional U ∈ 3d×k .

B. Baselines
To the best of our knowledge, no existing approach tackles the problem of learning a rotation-invariant bases for modeling human movement. Previous work is focused on either learning bases poses only from frontal viewing angles or by extensive manual crafting of a predefined set of poses [14], [16]. As such, we develop our approach by building upon classical baselines such as the SVD, which we briefly describe here. (1) is the most basic form of a latent factor model. When the training objective is to minimize the squared reconstruction error of the training data, then the solution can be recovered via SVD, also used for eigenfaces [26]. The bases matrix U and the coefficient matrix V respectively correspond (up to a scaling) to the left and right singular vectors of the mean-centered data matrix S c = (S −s). However, naively applying the SVD to our setting will result in the bases matrix U conflating viewing angle rotations with true pose deformations.

Singular Value Decomposition: The example in
Clustered Singular Value Decomposition: If the viewing angle of the training data is available, or a quantized approximation of it, then the basic latent factor model (1) can be instantiated separately for different viewing angles, via: Reference Image 2-D joint annotations 2-D Moveme where a j denotes the viewing angle cluster that example j belongs to. In other words, given p clusters, we learn p separate latent factor models, one per cluster. Intuitively, we expect this method to suffer less conflation between changes in pose due to a viewing angle rotation and true pose deformation, and the more clusters, the less susceptible. The main drawbacks are that: (i) the learned bases representation is not global, and will not be consistent across the clusters since they are learned independently, and (ii) the amount of training data per model is reduced, which can yield a worse representation.

C. Rotation-Invariant Latent Factor Model
Our goal is to develop a latent factor model that can learn a global representation of bases poses across different angles. For simplicity, we restrict ourselves to settings where there are only differences in the pan angle, and assume no variation in the tilt angle (i.e., all horizontal views). To that end, we propose both a 2-D and a 3-D model which can be used depending on the quality and quantity of additional information available at training time. For some applications it may suffice to use the 2-D model, however the 3-D model is generally better able to intrinsically capture rotation-invariance.
We first motivate some of the desirable properties: • Unsupervised -the bases discovery should not be limited to or dependent on images of specific classes of actions. • Rotation Invariant -the learned bases should be composed of movements from a given canonical view (e.g., frontal) and be able to reconstruct poses oriented at any angle. The exact same pose may look different when observed from different camera angles; as such, it is important to disambiguate pose from viewing angle. • Sparse -to encourage interpretability, the learned bases should be sparsely activated for any training instance. • Complementary -our method should be easy to integrate with other modeling approaches, and thus should implement an orthogonal extension of the basic latent factor modeling framework. General Framework: Our general framework aims to learn a latent factor matrix U, containing the bases poses instantiated globally across all the training data; a coefficient matrix V, whose columns correspond to the weights given to the bases poses to reconstruct all training instance; and a vector θ, containing the angle of view of each training pose. We can thus model every training example as: where f (·, ·) is a projection operator of the higher-dimensional model into the two-dimensional space. We train our model via: where E is the squared reconstruction error over the training instances, and Ω is a model-specific regularizer. The projection operator f and the regularizer Ω are specified separately for the 2-D and 3-D approach. This optimization problem is non-convex, and requires a reasonable initialization in order to converge to a good local optimum.

1) 2-D approach:
The 2-D approach, uses the same approach as the clustered SVD baseline and, given a set of p angle clusters, instantiates the projection operator as: a j denotes the cluster that θ j belongs to, and a separate rank-k U is learned for each viewing angle cluster. At this point, (7) looks identical to (2). However, we encourage global consistency between the per-cluster models via the regularization terms: The first term in (8) is a standard regularizer used to prevent overfitting: We wish to have sparse activations so we regularize V using L1 norm. Depending on the application, Sec. IV-B, we sometime enforce that V be non-negative for added interpretability. The second term in (8) is the spatial regularizer that encourages (or in some cases enforces) consistency across the per-cluster models: U (x) and U (y) represent the x and y coordinate portion of the bases poses: e.g.
where X is the set of indices corresponding to x coordinates in the pose representation. Since we are only modeling variations in the pan angle, the x coordinates can vary across different viewing angles, while the y coordinates should remain constant. As such, the first term in R spat , (10), corresponds to encouraging the U (x) (a) and U (x) (a ) of different clusters to be similar to each other (with κ a,a controlling the degree of similarity), and the second term, (11), is a {0, ∞} indicator function that takes value 0 if the two arguments are identical, and value ∞ if they are not (i.e., it is a hard constraint).
In summary, the spatial regularization term is the main difference between the 2-D latent factor model and the clustered SVD baseline. Global consistency of the per-cluster models is obtained by encouraging similar values in the x coordinates, and enforcing identical y coordinates. In a sense, one can view spatial regularization as a form of multi-task regularization, which enables sharing statistical strength across the clusters. The main limitation of the 2-D model is that the spatial regularization does not incorporate more sophisticated geometric constraints, so the notion of consistency achieved may not align with the true underlying three-dimensional data.
2) 3-D approach: The 3-D model directly learns a threedimensional representation of the underlying pose space, through a single and global U ∈ 3d×k that is inherently threedimensional, and captures k bases poses.
The projection operator is now defined as: where Q(·) is the 3-D rotation matrix around the vertical axis: and the superscript (x,y) denotes the projection from the 3-D space of U to the 2-D space of the dataset annotations, obtained by indexing only the x and y coordinates (the underlying model provides x, y, and z coordinates). The projection operator in (12) allows to compute the two-dimensional projection of any underlying three-dimensional pose at any viewing angle θ j using standard geometric rules. Spatial regularization is no longer needed, because the rotation operator Q relates all the viewing angles to a common model, thus the regularizer assumes the standard form: In summary, the 3-D latent factor model improves upon the 2-D version by learning a global representation that is intrinsically three-dimensional and integrates domain knowledge of how the viewing angle affects pose via geometric projection rules. This results in a more robust method, that does not learn a separate model per viewing angle or rely on the spatial regularization to obtain consistency. The main drawback is that a more complex initialization will be required.

D. Training Details
Initialization: Our approaches require an initial guess of the viewing angle for each training instance, and the bases poses U. For angle initialization, we show in our experiments (Sec. IV-B4) that we only need a fairly coarse prediction of the viewing angle (e.g., into quadrants). The 2-D latent factor model bases poses U are initialized uniformly between -1 and 1, while for the 3-D model we use an off-the-shelf pose estimator [16] and initialize U as the left singular vectors of the mean centered 3-D pose data, obtained through SVD.
Optimization: For both models, we optimize Eq. (4) using alternating stochastic gradient descent, divided in two phases: • Representation Update: we employ standard stochastic gradient descent to update U and V while keeping θ fixed. For the 3-D model, this involves computing how the training data (which are two-dimensional projections) induce a gradient on U and V through the rotation Q.
Because we employ an L1 regularization penalty, we use the standard soft-thresholding technique [33]. • Angle Update: Once the optimal U and V are fixed, we employ standard stochastic gradient descent to update θ. Convergence and Learning Rates: Three training epochs of 10000 iterations are usually sufficient for convergence to a good local minimum. Typical values of the learning rate are 1 × 10 −4 for U and V and 1 × 10 −6 for θ. We use a smaller step size in the update of θ, since the curvature of the objective function (4) w.r.t. θ is higher than w.r.t. U and V.

A. Dataset and Additional Annotations
We use the Leeds Sports Dataset (LSP) [32] for our experiments. LSP is composed of 2000 images containing a single person performing one of eight sports (Athletics, Badminton, Baseball, Gymnastics, Parkour, Soccer, Tennis, Volleyball) annotated with the x,y location and a visibility flag for 14 joints of the human body. Example images and annotations are shown in Fig. 1, 2 and 8. Sports activities are particularly well suited for this study, as they present characteristic motions that share trajectories of parts of the body, that allow investigating basis pose sharing across sports. As part of preprocessing, we normalize all the poses in the dataset by modifying each bone to have the average bone length computed over all the training instances [15]. We discard "Gymnastics" and "Parkour" from our analysis because they have few examples and the class poses do not vary exclusively along the pan angle (but appear in very unconventional views, i.e. upsidedown and horizontal), violating the assumption in Sec. III-C. Generalizing the framework, to incorporate a wider variability of the viewing angles, is an interesting future direction.
We collected high-quality viewing angle annotations for each pose in LSP. Although these annotations are not necessary for training, we use them to demonstrate the robustness of our model to poor angle initialization, and that it can in fact recover the ground truth value, see Sec. IV-B4. Three annotators evaluated each image and were instructed to provide the direction at which the torso was facing 2 . The standard deviation in the reported angle of view averaged over the whole dataset is 12 degrees, and more than half of the images have a deviation of less than 10 degrees, showing a very high annotator agreement for the task.

B. Empirical Results
We analyze the flexibility and usefulness of the proposed model in a variety of application domains and experiments. In particular, we evaluate (i) the performance of the learned representation for supervised learning tasks such as activity classification; (ii) whether the learned representation captures enough semantics for meaningful manifold traversal and visualization; and (iii) the robustness to initialization and the generalization error. Collectively, results suggest that our approach is effective at capturing rotation invariant semantics of the underlying data.

1) Activity Recognition:
The matrix V describes each pose in the dataset as a linear combination of the learned latent factors, Sec. III-A. Thus, v j can be interpreted as a semantically more meaningful feature representation for j-th data point. For instance, if a lower body basis pose (e.g. Fig. 6 top row) has a high weight, the reconstructed pose is very likely to represent a movement from an activity related to running, or kicking.
A natural way to test the effectiveness of the learned representation is to use it for supervised learning tasks. To that end, we used the coefficients in V as input features for classifying the sport categories in LSP. Fig. 4 shows the results obtained from five-fold cross validation. The proposed 3-D latent factor model ("lfa3d") outperforms all other methods by an average accuracy of about 11%. The 2-D model ("lfa2d") performs slightly worse than the clustered SVD baseline ("svd+rot"), but both show more than a 5% average improvement over the "svd" baseline. The two most challenging activities are "athletics", which does not posses characterizing movements; and "tennis", whose movemes are shared and thus confused with multiple other sports, "badminton" and "baseball" above all. We also report the full classification confusion tables in Fig. 4. Note that only the weights of the latent factors reconstructing a pose are being used to discriminate between the activities, without the aid of visual cues from the image. It is thus surprising that "lfa3d" achieves an average 39% accuracy, when a random guess would merely give 16.7%. Finally, the obtained feature representation is complementary to other representations, such as the hidden layer activations of a convolutional neural network [35], and we wish to investigate in future work the performance obtained by their combination.
2) Action Dynamics Inference & Manifold Traversal: Every pose in the training set belongs to a movement of the body corresponding to a complex trajectory in the manifold of human motion. If the latent factor model captures the semantics of the data, then poses that occur in chronological order within a given action should lie in a monotonic sequence within the learned space. A quantitative measure of the quality of the representation can be obtained by observing how well the order of poses belonging to a same action is preserved. One straightforward way to find the sequence in which a set of poses lies in the manifold, is to look at the coefficient of their projection along the "total least squares" line fit [36] of the corresponding columns in the matrix V. In other words, we are computing a linear traversal through the representation space. Furthermore, this ordering should hold regardless of the angle of view of the input instances.
In this experiment, we shuffled 1000 sequences of four images for four sport actions ("baseball pitch", "tennis forehand", "tennis serve", "baseball swing"), and verified how precisely could the underlying chronological sequence be recovered. The analysis is repeated five times to obtain standard deviations, and performance is measured in terms of three metrics: (1) what percentage of the 1000 sequences is exactly reordered; (2) how many poses are wrongly positioned; and (3) how bad are the reordering mistakes, computed as the number of swaps necessary to correct a sequence. Fig. 5 shows the results for the latent factor models "lfa2d", "lfa3d" and for the "svd" baseline. It is not possible to study  Fig. 5. Action Dynamics Inference Performance. We compare the methods "svd", "lfa2d", and "lfa3d" in the task of reordering shuffled sequences of images sampled from four different sport actions. The color scheme represents actions, the methods are plotted with a different transparency value. The performance is described in terms of: (1) number of sequences exactly reordered; (2) average number of errors contained in a sequence; (3) average number of swaps needed to obtain the correct sequence; (4) accuracy per position in the sequenceshown only for the best two methods ("lfa3d" -dark marker, "lfa2d" -light marker). Example sequences in Tab. I. Full details in Sec. IV-B2.
the performance of the clustered baseline "svd+rot" since it does not learn a global matrix U, thus the coefficients in V are not comparable across different viewing angles.
The "lfa3d" model has significantly better outcomes compared to "lfa2d" and "svd", which perform similarly. Specifically, "lfa3d" correctly reorders more than twice the sequences overall (1314 against 555 of "lfa2d") averages 1.6 errors, and is the only algorithm to require an average number of swaps smaller than 1. Fig. 5-(4) shows the per-position accuracy.
An example sequence for "tennis serve" is shown in Tab. I. Only the "lfa3d" method recovers the order correctly; note how the images are all taken from different viewing angles.
3) Moveme Visualization: The "lfa3d" method can be used to recover and synthesize realistic human motions from static joint locations in images. The underlying idea, is that models of human motion can be successfully learned from observations of poses of people performing various actions, as opposed to deriving mathematical principles which define control laws (e.g. inverse kinematics).
The most significant movemes contained in the training set are captured by the bases poses matrix U and encoded in the form of a displacement from the mean pose. Each column of U corresponds to a latent factor that describes some of the movement variability present in the data. Fig. 6 reports the motion described by three latent factors: the rows show the pose obtained by adding an increasing portion of the learned moveme (from 30% -second column, to 100% -last column) to the mean pose of the data (first column). Two are easily interpretable, "soccer kick" and "tennis forehand", while one is not as well defined, "volleyball strike / tennis serve". The movemes differentiate very quickly, as early as 30% of the final movement is added.
We verify empirically that two parameters mainly affect the correspondence between an action and a latent factor (moveme purity): the number of latent factors, and the constraints put on the coefficients of V. We obtain the best visualizations by approximately matching the number of latent factors with the number of recognizable actions contained in the dataset (10 for this experiment), and constraining the coefficients of V to be between 0 and 1.

4) Angle Recovery:
The "lfa3d" method learns a rotation invariant representation by treating the angle of view of each pose as a variable which is optimized through gradient descent (Sec. III-C2 and Fig. 3), and requires an initial guess for each training instance. We investigate how sensitive is the model to initialization, and how close is the recovered angle of view to the ground truth. Fig. 7(a) shows the Root Mean Squared Error (RMSE) and cosine similarity with ground truth, for three initialization methods: (1) "random", between 0 and 2π; (2) "coarse", coarsening into discrete buckets (e.g., 4 clusters indicates that we only know the viewing angle quadrant during initialization); and (3)  As the number of clusters increases, we see that performance remains constant for "random" and "ground truth", while both evaluation metrics improve significantly for "coarse" initialization. For instance, using just four clusters, "coarse" initialization obtains almost minimal RMSE and perfect cosine similarity. These results suggest that using very simple heuristics to predict the viewing angle quadrant of a pose is sufficient to obtain optimal performance.

5) Generalization Behaviour:
A desirable property of the obtained model is to be able to reconstruct with low error poses that are not contained in the training set, so the representation is not tied uniquely to the specific image collection it was learned from. To verify the generalization quality of the learned bases poses we trained the "lfa3d" model on a subset of the dataset and measured the RMSE on the remaining part, for an increasingly larger portion of the data. We repeated the experiment five times to obtain standard deviations.
As reported in Fig. 7(b), the RMSE over the training set is approximately constant, while the test set RMSE decreases significantly when going from 10% to 80% of the data used in training. This indicates that the learned latent factors can successfully reconstruct poses of unseen data. 6) Manifold Visualization: Fig. 8 visualizes an embedding of the manifold of human motion learned with the "lfa3d" method. Each pose in LSP is mapped in the human motion space through the coefficients of the corresponding column of V and then projected in two-dimensions using t-SNE [37].
Poses describing similar movements are mapped to nearby positions and form consistent clusters, whose relative distance depends on which latent factors are used to reconstruct the contained poses. Upper body movements are mapped closely  in the lower right corner, while lower body movements appear at the opposite end of the embedding. The mapping in the manifold is not affected by the direction each pose is facing, as nearby elements may have very different angle of view, confirming that the learned representation is rotation invariant. In Fig. 9, we show the heatmaps obtained from the activations of two latent factors from Fig. 6, overlaid on top of the t-SNE mapping of Fig. 8. To compute the heatmaps, we extract the coefficients for the "soccer kick" and "volleyball strike" latent factors from each column of V corresponding to a location in the embedding, and plot their value after normalization 3 .  Clearly, the epicentrum of the "volleyball strike" basis pose is located where volleyball-like poses appear in the t-SNE plot (lower-right corner). Noticeable upward arm movements are not as present in many other sports, hence the low intensity of the activation in the rest of the map. Conversely, the "soccer kick" basis pose is mostly dominant in the top-left area and the heatmap is diffused, consistent with the observation that most poses contain some movement of the legs.

V. Conclusion and Future Directions
In this paper, we proposed a model for learning the primitive movements underlying human actions (movemes) from a set of static 2-D poses obtained from images taken at various angles of view. The bases poses are rotation-invariant and learned through a modified latent matrix factorization that intrinsically accounts for geometric properties inherent to viewing angle variability. The approach can be trained efficiently, requires modest effort to identify a reasonable initialization, and yields very good generalization on unseen data.
We investigated the practical use of the learned representation for applications such as activity recognition and inference of action dynamics, observing significantly better performance compared to conventional baselines that do not account for variability of viewing angles. We used the bases poses for synthetic generation of movements, and explored how specific poses are mapped to different parts of the high-dimensional manifold of human motion.
One desirable property of our algorithm is that it is complementary to existing latent factor, pose estimation and feature extraction approaches, and may be used in combination with them to yield a better overall rotation-invariant representation.
An interesting future direction of investigation would be to use the proposed model in a semi-supervised setting where there is some availability of true three-dimensional data along with a large collection of two-dimensional joint locations.
Other possible extensions of our work are: learning to morph actions and synthesize unseen actions from the set of extracted movemes; inferring the location of occluded or missing joints based on the position of the visible ones; applying these techniques to large-scale datasets [38] in conjunction with fine grained annotations of the performed actions [9], [10] to gain new insights on the structure, complexity, and duration of human behaviour.