You can try to calculate two simple "statistics" - they will be distributed first (max-min). Silence will be very low. Secondly, there will be a variation - divide the range of possible values ββin 16 brackets (= range of values), and when you go through the elements, determine which bracket this element belongs to. Noise will have the same numbers for all brackets, while music or speech should prefer some of them, neglecting others.
This can be done in just one pass through the array, and you do not need complicated arithmetic, just adding and comparing values.
We also consider some approximation, for example, we take only every fourth value, thus reducing the number of tested elements to 80. For an audio signal, this should be good.
PeterK
source share