I am trying to correlate some audio files with some written passages of text.
I started with one audio file that someone read the dialed passage. Then I separate the audio files in each silence period with soxand similarly separate the type text so that each unique sentence is on a unique line.
Separations did not happen perfectly in every period, but whenever the speaker stopped. I need to create a list of those audio files that match the entered sentences, for example:
0001.wav This is a sentence.
0002.wav This is another sentence.
Please note that sometimes two or more audio files correspond to one sentence, for example:
- 0001.wav ("this is") + 0002.wav ("offer") = "This offer."
To help with text matching, I used software for counting syllables in audio and counting syllables in typed text.
I have two files with this data. The first, "sentences.txt", is a list of all sentences from text presented on one line, with their number of syllables, for example:
5 This is a sentence.
7 This is another sentence.
8 This is yet another sentence.
9 This is still yet another sentence.
I can delete the offer data using awk -f" " { print $1 } sentences.txtto have this syllables_in_text.txt:
5
7
8
9
syllables_in_audio.txt . , :
0001.wav 3
0002.wav 2
0003.wav 4
0004.wav 5
0005.wav 7
0006.wav 3
0007.wav 2
0008.wav 3
( "output.txt" ) , , "sentences.txt" , :
0001.wav 0002.wav
0003.wav 0004.wav
0005.wav
0006.wav 0007.wav 0009.wav
, , , , . "0001.wav" "0002.wav" , " ". 1 "output.txt", "sentences.txt" :
Contents of "output.txt": | Contents of "sentences.txt":
0001.wav 0002.wav | 5 This is a sentence.
0003.wav 0004.wav | 7 This is another sentence.
0005.wav | 8 This is yet another sentence.
0006.wav 0007.wav 0009.wav | 9 This is still yet another sentence.