Overview
Hand gesture is a potentially useful modality for visual-based interaction (VBI) in human-computer interaction (HCI), but estimating 3D hand posture and pose under camera viewpoint variations using 2D image is a very challenging problem due to the high intrinsic degrees of freedom (DOF) and the self-occlusion problem. The work mainly focuses on appearance-based 3D hand posture and pose estimation. We formulate the posture and pose problem to a nonlinear manifold learning, representation, and retrieval problem. With the 3D realistic hand model with commonly used hand postures, a large synthetic hand images database is built under different camera viewpoints. Each image with the shape features and the hand posture and pose ground-truth is a point on a highly nonlinear, complex hand pose and posture manifold. We apply the H-ISOSOM algorithm to learn the manifold with a concise, organized structure. Given the hand image, the features such as shape context based on 2D intensity edge and 2.5D depth edge are extracted. The relevant representatives with the hand posture and pose ground-truth on the manifold are retrieved by the learned map. Thus, the hand pose and posture are estimated.
Details
1. Introduction
Despite the rapid advances in computing communication and display technologies, the development of Human Computer Interaction (HCI) still lags behind. Gesture is a good candidate for the next generation input devices. It has the potential ability to relieve the interaction bottleneck between users and the computer. Vision-based gesture interpretation is a promising research area for this problem due to its passive and non-intrusive sensing properties.
Many approaches of 3D hand pose estimation to support gesture recognition may be classified into two categories: model-based approaches with 3D data and appearance-based approaches with 2D data. In this research, we take an image retrieval approach based on analysis by synthesis method. It utilizes a 3D realistic hand model and renders it from different viewpoints to generate synthetic hand images. Each image with the shape features and the hand posture and pose ground-truth is a point on a highly nonlinear, complex hand pose and posture manifold. We apply the H-ISOSOM algorithm to learn the manifold with a concise, organized structure. Given the hand image, the features such as shape context based on 2D intensity edge and 2.5D depth edge are extracted. The relevant representatives with the hand posture and pose ground-truth on the manifold are retrieved by the learned map. Thus, a set of possible hand images with the ground-truth posture and pose candidates is found. Because hand pose estimation is such a complex and high-dimensional problem, the pose estimate representing the best match may not be correct. Thus, the retrieval is considered successful if at least one of the candidates in top N matches is sufficiently close to the ground-truth. If N is small enough, with additional distinguishable contextual information, it may be adequate for automatic initialization and re-initialization problems in hand tracking systems or sign language recognition systems, where the correct estimation could be found and the incorrect ones could be eliminated in the later tracking.
2. Method
2.1 Problem formulation
Given the joint angles and the camera pose, the synthetic hand images are generated, each image is associated with hand configuration ground truth. The features of the synthetic hand image are extracted and represented by feature vectors. The hand configuration (the joint angle parameters), the camera viewpoint corresponding to the image, and the feature vector together is formed to big vectors in high dimensional space. We use those data samples to learn the manifold. In the estimation step, given an input hand image, the feature vector are extracted and compared with the neuron vectors with the corresponding components. The corresponding hand configuration part and the camera view point of the best match units are retrieved to be the possible candidates. The critical problem is how to learn and represent the manifold.

2.2 ISOSOM
The intuitive depiction of ISOSOM is illustrated in the figure. The learning process of ISOSOM algorithm contains three main steps: First, we construct a distance graph G of the manifold over all data points in input space X. The graph takes all points as nodes and adds an edge between two nodes i and j if i is one of the k nearest neighbors of j. The cost of the edge is measured by the Euclidean distance between I and j. The distance of any two nodes on the graph is defined by the cost of the shortest path between them. The low dimensional embedding Y and fxy-1 is constructed by the classic MDS algorithm based on the distance graph as in the Isomap algorithm. This step focuses on the global relationships of the data samples and encodes it to the dimensional embedding.
The second step is the exemplar learning and organization by the SOM algorithm. The data samples in the intermediate space Y are used as the training samples to train the organized structure. Similar to the map structure of SOM, the ISOSOM representative map is a lower dimensional lattice, A, formed by a set of organized processing units, called representatives. The representatives are connected with their neighbors on the lattice. Each representative ai is labeled by an index
and has reference vectors Yai attached. In each ISOSOM training step, we randomly choose a sample vector y from the training samples, and the response of a representative to the vector y is determined by the comparisons between y and the reference vector Yai of each representative with the geometric distance defined by the distance graph. The Best Matching Unit (BMU) is defined as the winner representative ai which has its reference vector Yai closest to the given input y. After obtaining the BMU, its prototype vectors Yai and its topological neighbors are updated and moved closer to the input vector y. Isomap maps a set of points from the high dimensional space X to a set of points in the low dimension space Y . For each sample point in X, there is a point in Y corresponding to it. After SOM training, the associate vector Yai of each representative is the new point in Y space generated by SOM iterative training. We need its corresponding point Xai in the high dimension space X. Since Isomap is a nonlinear dimension reduction techniques, it is hard to find the exact inverse mapping. Thus, we approximate its corresponding points in the original space X by the local linear interpolation (LLI) algorithm in the third step.
The LLI algorithm is motivated by the idea of Locally Linear Embedding (LLE), but the objective of LLI is to project the points from the low dimensional space Y to the high dimensional space X rather than reducing the dimensionality.
Given a query vector with full components or partial components with the mask w, the best match representatives are retrieved by the following similarity measurement:

where w(xa) is the mask function defined by W representing the existing components.
2.3 HISOSOM
Due to the computational complexity of ISOMAP, ISOSOM cannot handle large scale data sets (for example, more than one million training samples). In order to solve this problem, we present a hierarchical version of ISOSOM. The intuitive depiction of the Hierarchical-ISOSOM is illustrated in the figure. H-ISOSOM aims to defuse the computation complexity problem for large scale data sets, to improve the accuracy, and to construct a hierarchical structure for the fast retrieval or indexing. Hierarchical algorithms can be either divisive or agglomerative, i.e., top-down or bottom-up. Divisive hierarchical algorithms begin with the coarsest possible partition and split groups apart step by step. Alternatively, agglomerative hierarchical algorithms, which are widely used in clustering, start from the finest possible structure (each data point forms a cluster) and merge together at different levels by a certain criterion. We adopt the divisive strategy.
The complexity of the Isomap is O(N3) (where N is the size of the training sample)
4. The complexity for SOM is O(NMD) 5. Both of them have difficulties if the training sample size is more than 20,000. In order to make the training of large scale data sets tractable, H-ISOSOM follows a coarse-to-fine strategy. Let Nmax be the upper limit of the sample size that the algorithm of O(N3) complexity can handle in practice. HISOSOM first randomly samples the whole data set and obtains the subset with m samples, where m < Nmax, to train the first layer of the H-ISOSOM map using the ISOSOM algorithm. After that, for each representative on the map, it collects the training samples from the original data set and sub-samples them to train the next layer. The algorithm iteratively trains the sub-maps with ISOSOM until a certain criterion is reached. The criteria used in here are the mean quantization error (MQE) and signal-noise-ratio (SNR) between the map representative data set Smap and the training sample data set Strn. MQE is the average distance of each sample in the data set Strn to its nearest points in the map data set Smap (the quantization error). In the retrieval stage, given the input data, H-ISOSOM retrieves the top k BMUs, and continues to find the refined best match results in the next layer until the last layer is reached. Then it reorders all the retrieved nodes from the hierarchical structure according to the distance metric to obtain the final BMU set.

2.4 Depth Edges
A key limitation of hand feature extraction is hand shape acquisition. Most techniques are based on intensity edge detectors (using, for example, the Canny operator), which fail to capture important shape boundaries along the hand due to low intensity variation across skin-color regions. In addition, intensity edges include many unwanted edges due to background clutter and texture (such as wrinkles and nails). Rather than using intensity edge detectors for hand shape acquisition, we adopt a 2.5D hand representation, detecting edges only at depth discontinuities. A natural way to detect depth edges is to first reconstruct the scene (using stereo algorithms) and then look for discontinuities. However, 3D estimation algorithms are limited to produce accurate results exactly at depth discontinuities, due to occlusions and violation of smoothness constraints.
We use a Non-photorealistic Camera to bypass scene reconstruction and detect depth edges directly. We rely on a multi-flash camera with flashes strategically positioned to cast shadows along depth discontinuities. This allows us to reliably acquire hand shape (including internal finger edges), while considerably reducing background clutter. It shows that that depth discontinuities may be used as a signature to reliably discriminate among complex hand configurations. The figure compares hand shape acquisition based on depth edges and intensity edges. We show the comparison of our detected hand depth edges and the rendered 3D model edges, which are very similar.
Shape context, a translation invariant shape descriptor, gives the one-to-one correspondence for the sample points in two shapes, provides the transformations from one shape to another, and measures the similarity between shapes. We adopt the variant version of shape context, which is modified for scale invariance, and uses a 256-dimension feature vector to represent the shape.
3. Experiment results
Our synthesis database contains 15 commonly used hand gestures. These 15 gestures are defined at the semantic level. For each hand gesture, 16 hand configurations, which are created by adding a small turbulence to the basic hand configuration, are rended in 81 viewpoints sampled uniformly from the surface of the 3D view sphere, which represents the global rotation of the hand with respect to the camera. Overall, we generate a database containing 19440 synthetic images. For each hand configuration, 48 joint angle angle parameters are saved as the joint angle vectors, which are 3 rotation parameters for the hand, 9 parameters (3 rotation parameters for 3 joints respectively) for each finger and thumb. In addition to the 3 global rotation parameters of the camera, the hand pose vector is composed of these 51 parameters.

We collect a real hand image dataset with the similar 15 hand gestures in 10 different viewpoints, which are approximately sampled uniformly from the surface of the 3D view sphere. We capture 1 ∼ 2 sets of images for each pose. The real hand image database totally contains 226 cases. We label the pseudo-ground truth for each hand image by manually identifying the similar images in the synthesis database and assigning its hand pose parameters to the real image. The real hand images are captured with several kinds of backgrounds, some of them contains similar skin color and clutter.
In our experiments, we use depth edges and the variant of shape context descriptor to represent the given image. Each hand shape is represented by a 256 component vector. We define the correct match if the hand gesture of the real image is the same as one of the hand gestures in the top N retrieved images (with small changes in the hand configuration). The ISOSOM retrieval results are shown in the figures. The first image in the first row is the query image. The second image in the first row is the model generated by its pseudo-ground truth. The remaining 20 images are the retrieval results from the ISOSOM neurons. Generally speaking, the ISOSOM/H-ISOSOM is much faster than the exhausted search (nearest neighbor) and also increases the retrieval accuracy.
4. Conclusion
We have proposed the ISOSOM and H-ISOSOM algorithm and applied them to the problem of 3D hand pose estimation from a single 2D image. Instead of representing each synthetic image by an isolated item in the database, the idea is to construct a hand manifold with a small set of representatives organized by a structure. Thus, we reduce the information redundancy in the database by clustering data in the low dimensional manifold. The retrieval is done by searching in a small set of representatives instead of exhaustedly searching in the whole database. The experiments shows that our ISOSOM and HISOSOM are outperform the nearest neighbor and SOM algorithm.
References
[1] Rosales, R., Athitsos, V., Sclaroff, S.: 3d hand pose reconstruction using specialized mappings. In: International Conference on Computer Vision (ICCV). (2001) 378-385.
[2] Vassilis Athitsos and Stan Sclaroff. An appearance-based framework for 3D hand shape classification and camera viewpoint estimation. In Proceedings of the Fifth IEEE International Conf. on Automatic Face and Gesture Recognition, 2002.
[3] Vassilis Athitsos and Stan Sclaroff. Estimating 3D hand pose from a cluttered image. In Computer Vision and Pattern Recognition (CVPR), pages 432–439, June 2003.
[4] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, (5500):2319 –2323, Dec 2000.
[5] S. Roweis and L. Sault. Nonlinear dimensionality reduction by locally linear embedding. Science, (5500):2323 –2326, Dec 2000.
[6] Ramesh Raskar, Karhan Tan, Rogerio Feris, Jingyi Yu, and Matthew Turk. A nonphotorealistic camera: Depth edge detection and stylized rendering with multiflash
imaging. In ACM SIGGRAPH, 2004.
Publications
[1] Haiying Guan and Matthew Turk, “The Hierarchical Isometric Self-Organizing Map for Manifold Representation”, In IEEE Workshop on Component Analysis Methods for Classification, Clustering, Modeling, and Estimation Problems in Computer Vision, Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'07), pp. 1-8, 2007. (PDF)(Slides)
[2] Haiying Guan, Rogerio S. Feris, and Matthew Turk, “The Isometric Self-Organizing Map for 3D Hand Pose Estimation”, In Proc. of the 7th International Conference Automatic Face and Gesture Recognition (FG'06), pp. 263-268, 2006. (PDF) (Poster)
[3] Haiying Guan and Matthew Turk, “3D Hand Pose Reconstruction with ISOSOM”, In Proc. of the First International Symposium on Advances in Visual Computing (ISVC'05), pp. 630-635, 2005.