| Multi-view Appearance-based 3D Hand Pose Estimation
Haiying Guan, Jae Sik Chang, Longbin Chen, Rogerio S. Feris, and Matthew Turk
Jan. 2005 - July 2007
CS Department, University of California, Santa Barbara |  |
Overview
We describe a novel approach to appearance-based hand pose estimation which relies on multiple cameras to improve accuracy and resolve ambiguities caused by self-occlusions. Rather than estimating 3D geometry, our approach uses multiple views to extend current exemplar-based methods for estimating hand pose by matching a probe image with a large discrete set of labeled hand pose images. We formulate the problem in a MAP framework, where the information from multiple cameras is fused to provide reliable hand pose estimation. Our quantitative experimental results show that the correct estimation rate is much higher using our multi-view approach than using a single-view approach.
Details
1.1 Motivation
Occlusion is a major problem of appearance-based systems. It is hard to resolve the ambiguity using a single side view image. An extreme example is shown in Fig. 1. If only the side view (shown in the second row) of the gesture is given, it is hard to recognize the real postures. Thus, we propose a multi-view hand pose estimation algorithm.

Figure 1. Hand occlusion problem of appearance-based posture estimation method
1.2 MAP framework using multi-view images
To effectively utilize the information obtained from multi-view cameras, we proposed a MAP framework based on Bayesian theory. The posteriori probability of hand state given hand images assuming the conditional independency shown in Fig. 2 is given by:


Figure 2. Hand images' conditional independency
The probability of hand image given hand states is

The likelihood of hand real image given hand model image can be estimated by

The maximization of posteriori probability is simplified to minimize the following energy function:
1.3 Homogeneous transformations
Fig. 3 illustrates the system setup. The cameras could be set up at any locations surrounding and pointing to the hand.

Figure 3 The system setup
The transformation from camera 2 to camera 1 is given by
The transformation from hand frame to camera frame is given by

¡¡
1.4 Hand Contour Extraction
We build an active contour model based on level sets methods.
In our model, the common boundary is given by the motion equation (region competition) is given by (illustrated in Fig. 5),

The level set equation is given by


Figure 5 Active contour model
Fig. 5 shows the experimental results for hand contour and region detection.

Figure 6 Hand contour extraction and hand region detection results
1.5 Experimental Results
1.5.1 Synthesis dataset
Our synthesis dataset contains 15 gestures, and each gesture is captured from 448 viewpoints evenly sampled from a 3D view sphere. The synthesis dataset contains 20160 images totally.
1.5.2 Real image test dataset
Our test dataset contains 7 hand states for each gesture, and totally has 254 cases, 508 hand images.
Table 1. shows the comparisons of retrieval rates using the hand images captured by Camera 1 only, by Camera 2 only and by both of them.
Cam \ K | 50 | 100 | 150 | 200 | 250 | 300 | Cam 1 only | 19.69 | 30.71 | 36.61 | 39.76 | 44.88 | 49.61 | Cam 2 only | 14.57 | 23.23 | 29.13 | 32.68 | 37.40 | 40.55 | Both | 48.03 | 61.42 | 69.69 | 74.80 | 78.74 | 81.50 |
Table 1. The comparison of retrieval rates |
Figure 7 shows the comparison curves. It clearly shows that the retrieval performance using two viewpoints is much better than using single viewpoint.
Figure 7 The performance comparison of the single view and two view algorithm
1.6 Conclusions
In this project, we proposed a MAP framework to resolve ambiguities caused by the occlusion problem, which is the main problem of the appearance-based pose estimation algorithms. The experimental results shows that the performance of the multi-view algorithm based on MAP frame work is much better than the single view algorithm. Active contour model based on level sets algorithm is implemented for hand contour extraction and hand region detection. The algorithm works well in complex background.
Publications
[1] Haiying Guan, Longbin Chen, and Matthew Turk, “Multi-view Hand Pose and Posture Manifold Representation using the Hierarchical Isometric Self-Organizing Map”, submitted to Image Vision and Computing, Aug. 2, 2008.
[2] Haiying Guan, Jae Sik Chang, Longbin Chen, Rogerio S. Feris, and Matthew Turk, “Multi-view appearance-based 3D hand pose estimation”, In IEEE Workshop on Vision for Human Computer Interaction, Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'06), pp. 154-159, 2006. (PDF)(Slides)