One way to use HMM for gesture recognition would be to use a similar architecture commonly used for speech recognition.
The HMM will not exceed the space, but over time, and each video frame (or a set of extracted functions from the frame) will represent radiation from the HMM state.
Unfortunately, HMM-based speech recognition is a fairly large area. Many books and theses have been written describing different architectures. I recommend starting with Jelinek's "Statistical Speech Recognition Methods" ( http://books.google.ca/books?id=1C9dzcJTWowC&pg=PR5#v=onepage&q&f=false ), then following the recommendations from there. Another resource is the CMu sphinx webpage ( http://cmusphinx.sourceforge.net ).
Another thing to keep in mind is that HMM-based systems are probably less accurate than discriminatory approaches like conditional random fields or max-margin recognizers (e.g. SVM-struct).
For an HMM-based recognizer, the general learning process usually looks like this:
1) Do some signal processing on raw data
- For speech, this involves converting the raw sound into mel-cepstrum format, while for gestures this may include extracting image functions (SIFT, GIST, etc.).
2) Apply vector quantization (VQ) (other dimensional reduction methods can also be used) for the processed data
- Each cluster center is usually associated with a base unit of task. For example, in speech recognition, each centroid can be associated with a phoneme. For the gesture recognition task, each VQ centroid can be associated with a pose or arm configuration.
3) Manually create an HMM whose state transitions capture a sequence of different poses in a gesture.
The emission distribution of these HMM states will be centered on the VQ vector from step 2.
In speech recognition, these HMMs are built from grammar dictionaries that give a sequence of phonemes for each word.
4) Build a single HMM that contains transitions between each individual HMM gesture (or, in the case of speech recognition, each phoneme HMM). Then prepare a composite HMM with gesture videos.
- At this point, it is also possible to individually teach each HMM gesture before the joint learning phase. This additional learning step may result in better recognition.
For the recognition process, apply the signal processing step, find the nearest VQ record for each frame, then find the path with a high search coefficient through the HMM (either the Viterbi path or one of the set of paths from the A * search) taking into account the quantized vectors. This path gives predicted gestures in the video.
source share