HMM gesture recognition algorithm

I want to develop a gesture recognition application using Kinect and hidden Markov models . I watched the tutorial here: HMM lecture

But I don’t know where to start. What is state and how to normalize data to implement HMM training? I know (more or less) how this should be done for signals and for simple cases from left to right, but the 3D space is a little confusing to me. Can someone describe how this should be started?

Can someone describe the steps to do this? I especially need to know how to make a model and what the steps of the HMM algorithm should be.

+6
source share
2 answers

One way to use HMM for gesture recognition would be to use a similar architecture commonly used for speech recognition.

The HMM will not exceed the space, but over time, and each video frame (or a set of extracted functions from the frame) will represent radiation from the HMM state.

Unfortunately, HMM-based speech recognition is a fairly large area. Many books and theses have been written describing different architectures. I recommend starting with Jelinek's "Statistical Speech Recognition Methods" ( http://books.google.ca/books?id=1C9dzcJTWowC&pg=PR5#v=onepage&q&f=false ), then following the recommendations from there. Another resource is the CMu sphinx webpage ( http://cmusphinx.sourceforge.net ).

Another thing to keep in mind is that HMM-based systems are probably less accurate than discriminatory approaches like conditional random fields or max-margin recognizers (e.g. SVM-struct).

For an HMM-based recognizer, the general learning process usually looks like this:

1) Do some signal processing on raw data

  • For speech, this involves converting the raw sound into mel-cepstrum format, while for gestures this may include extracting image functions (SIFT, GIST, etc.).

2) Apply vector quantization (VQ) (other dimensional reduction methods can also be used) for the processed data

  • Each cluster center is usually associated with a base unit of task. For example, in speech recognition, each centroid can be associated with a phoneme. For the gesture recognition task, each VQ centroid can be associated with a pose or arm configuration.

3) Manually create an HMM whose state transitions capture a sequence of different poses in a gesture.

  • The emission distribution of these HMM states will be centered on the VQ vector from step 2.

  • In speech recognition, these HMMs are built from grammar dictionaries that give a sequence of phonemes for each word.

4) Build a single HMM that contains transitions between each individual HMM gesture (or, in the case of speech recognition, each phoneme HMM). Then prepare a composite HMM with gesture videos.

  • At this point, it is also possible to individually teach each HMM gesture before the joint learning phase. This additional learning step may result in better recognition.

For the recognition process, apply the signal processing step, find the nearest VQ record for each frame, then find the path with a high search coefficient through the HMM (either the Viterbi path or one of the set of paths from the A * search) taking into account the quantized vectors. This path gives predicted gestures in the video.

+10
source

I implemented a 2-dimensional version of this for the PGM Coursera class, which has kinect gestures as the final unit.

https://www.coursera.org/course/pgm

Basically, the idea is that you cannot use HMM to really solve poses very well. In our unit, I used some variation of K-means to segment poses into probabilistic categories. HMM was used to actually decide which sequences of poses are really viable, like gestures. But any clustering algorithm that runs on multiple poses is a good candidate, even if you don’t know what kind of posture or something like that.

From there, you can create a model that learns from the cumulative probabilities of each possible pose for each kinect data point.

I know this is a little rare interview. This class provides an excellent overview of the state of the art, but the problem as a whole is too complex to be compressed into an easy answer. (I would recommend accepting it in April if you are interested in this field)

+1
source

All Articles