1) . This is the right approach to incorporate voice recognition into a service, such as the Google api, where callback methods are used to get the results. In order for it to work continuously, the service must deal with wakelock, which will avoid getting into sleep mode. For more information, see Periodic blocking to block attacks on Android. This has one big drawback - the high battery efficiency, caused by the continuous operation of the processor and unprecedented calculations of incoming audio data. (Can be reduced using filters, thresholds, etc.)
2) Voice recognition is not an easy task. It requires a huge amount of computation and data for reference. If the input sound is not clear (noise, many human voices, etc.), it is more difficult to get the right output. What can be done to improve accuracy is the input sound of the filter: noise reduction, low-pass filter, etc. You cannot expect 100% accuracy, but you can reach 80-95%.
Itβs more difficult to filter out many human voices. But some simple amplitude (sound power level) algorithms can be used with an adaptive threshold that decides when a word begins and ends. The idea is that the correct voice is the loudest = closest to the phone / device. Thus, according to 4) accuracy is better when the user speaks close to the microphone, because it is the loudest voice.
3) I do not know what you mean by a sensor, but there are algorithms that simply determine the human voice, rather than decode words. These algorithms are called voice activity detection (VAD). Some code should be found in the Speex project documentation http://www.speex.org/
The easiest way to process speech recognition is to use the Google Speech api, which is very good, and it recognizes many languages, but requires an Internet connection, and it takes some time to get the result. CMU Sphinx is faster, but it has several language models, it needs more RAM and proccesor calculation, since all decoding is done on the device. In my opinion, this is very good when dicitionary (words that were canceled) are small, like commands (left, right, back, stop, start, etc.).