How to detect silence and cut mp3 file without re-encoding using NAudio and .NET

I was looking for an answer everywhere, and I could find a few pieces. What I want to do is upload a few mp3 files (for example, temporarily merge them) and then cut them into pieces using silence detection.

I understand that I can use Mp3FileReader for this, but questions: 1. How to read 20 seconds of sound from an mp3 file? Do I need to read reader.WaveFormat.AverageBytesPerSecond 20 times? Or maybe keep reading frames until the sum of Mp3Frame.SampleCount / Mp3Frame.SampleRate exceeds 20 seconds? 2. How do I actually detect silence? I would look at the appropriate number of consecutive samples to see if they are all below a certain threshold. But how can I access the samples, regardless of the fact that they are 8 or 16 bits, mono or stereo, etc.? Can I directly decode an MP3 frame? 3. After I found silence in the response example 10465, how do I map it back to the index of the mp3 frame to perform cutting without re-encoding?

+7
naudio
source share
3 answers

BEFORE READING BELOW: Marking the answer is much easier to implement, and you will almost certainly be pleased with the results. This answer is intended for those who are willing to spend an excessive amount of time on this.

So, with that said, shortening an MP3 file based on silence without re-encoding or full decoding is actually possible ... Basically, you can see information about each side of the frame and each granular gain and huffman data to “evaluate” silence.

  • Find the silence
  • Copy all frames from silence to a new file

now it's getting complicated ...

  • Pull the audio data from the frames after silence, watching what frame title goes with which audio data.
  • Start writing a second new file, but as you write the frames, update the main_data_begin field so that the bit-tank synchronizes with where the actual audio data is.
+2
source share

Here is an approach I would recommend (which requires re-encoding)

  • Use AudioFileReader to get MP3s as floating point patterns directly in the Read method.
  • Find an open source noise blocking algorithm, place it in C # and use it to detect silence (i.e. when the noise shutter is closed, you have silence. You need to set the threshold and attack / release time)
  • Create a derived ISampleProvider that uses a noise shutter, and in its Read method does not return samples that are in silence
  • Either: Transfer the output to WaveFileWriter to create a WAV file and encode the WAV file to MP3 Or: use NAudio.Lame to encode directly without a WAV step. You will probably have to switch from SampleProvider to a 16-bit WAV provider first.
+3
source share

MP3 is a compressed audio format. You can't just cut the bits and expect the rest to remain a valid MP3 file. In fact, since this is a DCT-based transform, the bits are in the frequency domain instead of the time domain. There are simply no bits for sample 10465. There is a frame that contains sample 10465, and there is a set of bits describing all frequencies in this frame.

The usual cutting of sound on sample 10465 and continuation with some random other sample is likely to lead to a break, which means the number of frequencies present in the emerging missiles. Thus, it definitely means a complete re-code. It’s best to smooth the transition, but this is not a trivial operation. And the result, of course, is slightly different from the input, so it still means transcoding.

  • I don’t understand why you want to read 20 seconds of sound anyway. Where does this number come from? Usually you want to read everything.

  • Sound is a wave; he fully expected him to cross zero. Therefore, being close to zero is not special. For a 20 Hz wave (auditory threshold), zero crossings occur 40 times per second, but each time you have several samples near zero. Therefore, you basically need a few samples that are close to zero, but on both sides. 5 6 7 not much for 16-bit sounds, but it could very well be part of a wave that will have a maximum of 10,000. You really need to check at least 0.05 seconds to catch these 20 Hz sounds.

  • Since you find silence in the 50 millisecond range, you have a “position” that is about a few hundred wide. With some luck, there is a border to the frame. Cut there. Once again for transcoding.

+2
source share

All Articles