Microphone audio stream to IBM Watson SpeechToText web service using Java SDK

Attempted to send continuous audio from a microphone directly to the IBM Watson SpeechToText web service using the Java SDK. One example provided with distribution ( RecognizeUsingWebSocketsExample ) shows how to transfer a .WAV file to a service. However, .WAV files require that their length be set in advance, so a naive approach to simply adding one buffer to a file at a time is not possible.

It seems that SpeechToText.recognizeUsingWebSocket can receive the stream, but giving it an instance of AudioInputStream does not seem to look like the connection is established, but no transcripts are returned, although RecognizeOptions.interimResults(true) .

 public class RecognizeUsingWebSocketsExample { private static CountDownLatch lock = new CountDownLatch(1); public static void main(String[] args) throws FileNotFoundException, InterruptedException { SpeechToText service = new SpeechToText(); service.setUsernameAndPassword("<username>", "<password>"); AudioInputStream audio = null; try { final AudioFormat format = new AudioFormat(16000, 16, 1, true, false); DataLine.Info info = new DataLine.Info(TargetDataLine.class, format); TargetDataLine line; line = (TargetDataLine)AudioSystem.getLine(info); line.open(format); line.start(); audio = new AudioInputStream(line); } catch (LineUnavailableException e) { // TODO Auto-generated catch block e.printStackTrace(); } RecognizeOptions options = new RecognizeOptions.Builder() .continuous(true) .interimResults(true) .contentType(HttpMediaType.AUDIO_WAV) .build(); service.recognizeUsingWebSocket(audio, options, new BaseRecognizeCallback() { @Override public void onTranscription(SpeechResults speechResults) { System.out.println(speechResults); if (speechResults.isFinal()) lock.countDown(); } }); lock.await(1, TimeUnit.MINUTES); } } 

Any help would be greatly appreciated.

-rg

Here's an update based on the German comment below (thanks for that).

I was able to use javaFlacEncode to hide the WAV stream coming from the microphone into the FLAC stream and save it to a temporary file. Unlike a WAV audio file, the size of which is fixed at creation, a FLAC file can be easily added.

  WAV_audioInputStream = new AudioInputStream(line); FileInputStream FLAC_audioInputStream = new FileInputStream(tempFile); StreamConfiguration streamConfiguration = new StreamConfiguration(); streamConfiguration.setSampleRate(16000); streamConfiguration.setBitsPerSample(8); streamConfiguration.setChannelCount(1); flacEncoder = new FLACEncoder(); flacOutputStream = new FLACFileOutputStream(tempFile); // write to temp disk file flacEncoder.setStreamConfiguration(streamConfiguration); flacEncoder.setOutputStream(flacOutputStream); flacEncoder.openFLACStream(); ... // convert data int frameLength = 16000; int[] intBuffer = new int[frameLength]; byte[] byteBuffer = new byte[frameLength]; while (true) { int count = WAV_audioInputStream.read(byteBuffer, 0, frameLength); for (int j1=0;j1<count;j1++) intBuffer[j1] = byteBuffer[j1]; flacEncoder.addSamples(intBuffer, count); flacEncoder.encodeSamples(count, false); // 'false' means non-final frame } flacEncoder.encodeSamples(flacEncoder.samplesAvailableToEncode(), true); // final frame WAV_audioInputStream.close(); flacOutputStream.close(); FLAC_audioInputStream.close(); 

The resulting file can be analyzed (using curl or recognizeUsingWebSocket() ) without any problems after adding an arbitrary number of frames. However, recognizeUsingWebSocket() will return the final result as soon as it reaches the end of the FLAC file, even if the last last file may not be final (i.e. After encodeSamples(count, false) ).

I expect that recognizeUsingWebSocket() will be locked until the last file is written to the file. From a practical point of view, this means that the analysis stops after the first frame, since it takes less time to analyze the first frame than to collect the second, so when the results are returned, the end of the file is reached.

Is this the right way to implement microphone audio streaming in Java? Looks like a normal use case.


Here's a modification to RecognizeUsingWebSocketsExample , including some of Daniel's suggestions below. It uses the PCM content type (passed as String along with the frame size) and tries to signal the end of the audio stream, although not very successful.

As before, the connection is made, but the recognition call is never called. Closing the stream does not seem to be interpreted as the end of the sound. I have to misunderstand something here ...

  public static void main(String[] args) throws IOException, LineUnavailableException, InterruptedException { final PipedOutputStream output = new PipedOutputStream(); final PipedInputStream input = new PipedInputStream(output); final AudioFormat format = new AudioFormat(16000, 8, 1, true, false); DataLine.Info info = new DataLine.Info(TargetDataLine.class, format); final TargetDataLine line = (TargetDataLine)AudioSystem.getLine(info); line.open(format); line.start(); Thread thread1 = new Thread(new Runnable() { @Override public void run() { try { final int MAX_FRAMES = 2; byte buffer[] = new byte[16000]; for(int j1=0;j1<MAX_FRAMES;j1++) { // read two frames from microphone int count = line.read(buffer, 0, buffer.length); System.out.println("Read audio frame from line: " + count); output.write(buffer, 0, buffer.length); System.out.println("Written audio frame to pipe: " + count); } /** no need to fake end-of-audio; StopMessage will be sent * automatically by SDK once the pipe is drained (see WebSocketManager) // signal end of audio; based on WebSocketUploader.stop() source byte[] stopData = new byte[0]; output.write(stopData); **/ } catch (IOException e) { } } }); thread1.start(); final CountDownLatch lock = new CountDownLatch(1); SpeechToText service = new SpeechToText(); service.setUsernameAndPassword("<username>", "<password>"); RecognizeOptions options = new RecognizeOptions.Builder() .continuous(true) .interimResults(false) .contentType("audio/pcm; rate=16000") .build(); service.recognizeUsingWebSocket(input, options, new BaseRecognizeCallback() { @Override public void onConnected() { System.out.println("Connected."); } @Override public void onTranscription(SpeechResults speechResults) { System.out.println("Received results."); System.out.println(speechResults); if (speechResults.isFinal()) lock.countDown(); } }); System.out.println("Waiting for STT callback ... "); lock.await(5, TimeUnit.SECONDS); line.stop(); System.out.println("Done waiting for STT callback."); } 

Dani, I measured the source for WebSocketManager (supplied with the SDK) and replaced the call to sendMessage() with the explicit StopMessage as follows:

  /** * Send input steam. * * @param inputStream the input stream * @throws IOException Signals that an I/O exception has occurred. */ private void sendInputSteam(InputStream inputStream) throws IOException { int cumulative = 0; byte[] buffer = new byte[FOUR_KB]; int read; while ((read = inputStream.read(buffer)) > 0) { cumulative += read; if (read == FOUR_KB) { socket.sendMessage(RequestBody.create(WebSocket.BINARY, buffer)); } else { System.out.println("completed sending " + cumulative/16000 + " frames over socket"); socket.sendMessage(RequestBody.create(WebSocket.BINARY, Arrays.copyOfRange(buffer, 0, read))); // partial buffer write System.out.println("signaling end of audio"); socket.sendMessage(RequestBody.create(WebSocket.TEXT, buildStopMessage().toString())); // end of audio signal } } inputStream.close(); } 

None of the sendMessage () parameters (sending binary content 0 lines long or sending a stop text message) seems to work. Caller ID does not change above. Result:

 Waiting for STT callback ... Connected. Read audio frame from line: 16000 Written audio frame to pipe: 16000 Read audio frame from line: 16000 Written audio frame to pipe: 16000 completed sending 2 frames over socket onFailure: java.net.SocketException: Software caused connection abort: socket write error 

REVISED: in fact, calling the end of the audio is not achieved. An exception occurs when writing the last (partial) buffer to the socket.

Why is the connection disconnected? This usually happens when the partner terminates the connection.

Regarding paragraph 2): will it be at this stage? It seems that the recognition process does not start at all ... The sound is valid (I wrote a stream to disk and was able to recognize it by placing it from a file, as I indicate above).

In addition, upon further consideration of the source code, WebSocketManager onMessage() already sends StopMessage immediately after return from sendInputSteam() (i.e., when the audio stream or pipe in the example above is sinks), so there is no need to call it explicitly. The problem, of course, occurs before the completion of the audio transmission. The behavior is the same regardless of whether the PipedInputStream or AudioInputStream as input. An exception is thrown when sending binary data in both cases.

+5
source share
2 answers

There is an example in the Java SDK and supports this.

Update your pom.xml with

  <dependency> <groupId>com.ibm.watson.developer_cloud</groupId> <artifactId>java-sdk</artifactId> <version>3.3.1</version> </dependency> 

Here is an example of how to listen to your microphone.

 SpeechToText service = new SpeechToText(); service.setUsernameAndPassword("<username>", "<password>"); // Signed PCM AudioFormat with 16kHz, 16 bit sample size, mono int sampleRate = 16000; AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false); DataLine.Info info = new DataLine.Info(TargetDataLine.class, format); if (!AudioSystem.isLineSupported(info)) { System.out.println("Line not supported"); System.exit(0); } TargetDataLine line = (TargetDataLine) AudioSystem.getLine(info); line.open(format); line.start(); AudioInputStream audio = new AudioInputStream(line); RecognizeOptions options = new RecognizeOptions.Builder() .continuous(true) .interimResults(true) .timestamps(true) .wordConfidence(true) //.inactivityTimeout(5) // use this to stop listening when the speaker pauses, ie for 5s .contentType(HttpMediaType.AUDIO_RAW + "; rate=" + sampleRate) .build(); service.recognizeUsingWebSocket(audio, options, new BaseRecognizeCallback() { @Override public void onTranscription(SpeechResults speechResults) { System.out.println(speechResults); } }); System.out.println("Listening to your voice for the next 30s..."); Thread.sleep(30 * 1000); // closing the WebSockets underlying InputStream will close the WebSocket itself. line.stop(); line.close(); System.out.println("Fin."); 
+6
source

what you need to do is transfer the audio to the STT service not as a file, but as a stream without sound without sound. You simply download the samples that you capture from the microphone via WebSocket. You need to set the content type to "audio / pcm; rate = 16000", where 16000 is the sampling frequency in Hz. If your sampling rate is different, depending on how the microphone encodes the audio, you will replace 16000 with your value, for example: 44100, 48000, etc.

When pcm makes a sound, the STT service does not stop recognizing until you tell the end of the sound by sending an empty binary message through the web cell.

Dani


Looking at the new version of your code, I see some problems:

1) the signal end of the audio can be accomplished by sending an empty binary message via websocket, this is not what you do. Lines

  // signal end of audio; based on WebSocketUploader.stop() source byte[] stopData = new byte[0]; output.write(stopData); 

do nothing, because they do not send an empty websocket message. Could you name the method "WebSocketUploader.stop ()"?

  1. You record audio with 8 bits per sample, you have to do 16 bits for a sufficient queue. Also, you only feed a couple of seconds of audio, and not ideal for testing. Can you write any sound that you click on STT to a file and then open it using Audacity (using the import function)? This way you can make sure that you are giving good sound to the STT.
0
source

All Articles