It seems that you do not know Cloud Cloud Library (PCL) . It is an open source library designed to process point clouds and RGB-D data, which is based on OpenNI for low-level operations and which provides a lot of high-level algorithms , such as registration, segmentation and recognition.
A very interesting figure / object recognition algorithm as a whole is called an implicit model. To detect a global object (for example, a car or an open hand), you first need to detect possible parts (for example, wheels, torso, etc., fingers, palm, wrist, etc.) using a local feature detector, and then for determining the position of a global object by considering the density and relative position of its parts. For example, if I can detect five fingers, a palm and a wrist in a certain area, there is a good chance that I actually look at my hand, however, if I find only one finger and a wrist somewhere, it could be a couple of false positives. A research article about this implicit figure model algorithm can be found here .
PCL has a couple of tutorials on the topic of shape recognition, and, fortunately, one of them covers the implicit model model that was implemented in PCL. I have never tested this implementation, but from what I could read in the tutorial, you can specify your own point clouds for training the classifier.
If you said , you did not explicitly mention it in your question, but since your goal is to program a manually-controlled application, you may actually be interested in real-time form detection algorithm. You will need to check the speed of the implicit shape model presented in PCL, but I think this approach is better for offline recognition of the shape.
If you need real-time shape recognition, I think you should first use a hand / hand tracking algorithm (which is usually faster than full detection) to know where to look for images, rather than trying to perform a full shape detection on every frame your RGB-D stream. For example, you can track the location of a hand by segmenting a depth map (for example, using the appropriate threshold at depth) and then detecting extermination.
Then, as soon as you know exactly where the hand is, it should be easier to decide whether the hand makes one gesture related to your application. I'm not sure what you mean by fist / grab gestures, but I suggest you identify and use some application control gestures that are quick and easy to distinguish from one another.
Hope this helps.