Attempted to send continuous audio from a microphone directly to the IBM Watson SpeechToText web service using the Java SDK. One example provided with distribution ( RecognizeUsingWebSocketsExample ) shows how to transfer a .WAV file to a service. However, .WAV files require that their length be set in advance, so a naive approach to simply adding one buffer to a file at a time is not possible.
It seems that SpeechToText.recognizeUsingWebSocket can receive the stream, but giving it an instance of AudioInputStream does not seem to look like the connection is established, but no transcripts are returned, although RecognizeOptions.interimResults(true) .
public class RecognizeUsingWebSocketsExample { private static CountDownLatch lock = new CountDownLatch(1); public static void main(String[] args) throws FileNotFoundException, InterruptedException { SpeechToText service = new SpeechToText(); service.setUsernameAndPassword("<username>", "<password>"); AudioInputStream audio = null; try { final AudioFormat format = new AudioFormat(16000, 16, 1, true, false); DataLine.Info info = new DataLine.Info(TargetDataLine.class, format); TargetDataLine line; line = (TargetDataLine)AudioSystem.getLine(info); line.open(format); line.start(); audio = new AudioInputStream(line); } catch (LineUnavailableException e) {
Any help would be greatly appreciated.
-rg
Here's an update based on the German comment below (thanks for that).
I was able to use javaFlacEncode to hide the WAV stream coming from the microphone into the FLAC stream and save it to a temporary file. Unlike a WAV audio file, the size of which is fixed at creation, a FLAC file can be easily added.
WAV_audioInputStream = new AudioInputStream(line); FileInputStream FLAC_audioInputStream = new FileInputStream(tempFile); StreamConfiguration streamConfiguration = new StreamConfiguration(); streamConfiguration.setSampleRate(16000); streamConfiguration.setBitsPerSample(8); streamConfiguration.setChannelCount(1); flacEncoder = new FLACEncoder(); flacOutputStream = new FLACFileOutputStream(tempFile); // write to temp disk file flacEncoder.setStreamConfiguration(streamConfiguration); flacEncoder.setOutputStream(flacOutputStream); flacEncoder.openFLACStream(); ... // convert data int frameLength = 16000; int[] intBuffer = new int[frameLength]; byte[] byteBuffer = new byte[frameLength]; while (true) { int count = WAV_audioInputStream.read(byteBuffer, 0, frameLength); for (int j1=0;j1<count;j1++) intBuffer[j1] = byteBuffer[j1]; flacEncoder.addSamples(intBuffer, count); flacEncoder.encodeSamples(count, false); // 'false' means non-final frame } flacEncoder.encodeSamples(flacEncoder.samplesAvailableToEncode(), true); // final frame WAV_audioInputStream.close(); flacOutputStream.close(); FLAC_audioInputStream.close();
The resulting file can be analyzed (using curl or recognizeUsingWebSocket() ) without any problems after adding an arbitrary number of frames. However, recognizeUsingWebSocket() will return the final result as soon as it reaches the end of the FLAC file, even if the last last file may not be final (i.e. After encodeSamples(count, false) ).
I expect that recognizeUsingWebSocket() will be locked until the last file is written to the file. From a practical point of view, this means that the analysis stops after the first frame, since it takes less time to analyze the first frame than to collect the second, so when the results are returned, the end of the file is reached.
Is this the right way to implement microphone audio streaming in Java? Looks like a normal use case.
Here's a modification to RecognizeUsingWebSocketsExample , including some of Daniel's suggestions below. It uses the PCM content type (passed as String along with the frame size) and tries to signal the end of the audio stream, although not very successful.
As before, the connection is made, but the recognition call is never called. Closing the stream does not seem to be interpreted as the end of the sound. I have to misunderstand something here ...
public static void main(String[] args) throws IOException, LineUnavailableException, InterruptedException { final PipedOutputStream output = new PipedOutputStream(); final PipedInputStream input = new PipedInputStream(output); final AudioFormat format = new AudioFormat(16000, 8, 1, true, false); DataLine.Info info = new DataLine.Info(TargetDataLine.class, format); final TargetDataLine line = (TargetDataLine)AudioSystem.getLine(info); line.open(format); line.start(); Thread thread1 = new Thread(new Runnable() { @Override public void run() { try { final int MAX_FRAMES = 2; byte buffer[] = new byte[16000]; for(int j1=0;j1<MAX_FRAMES;j1++) {
Dani, I measured the source for WebSocketManager (supplied with the SDK) and replaced the call to sendMessage() with the explicit StopMessage as follows:
private void sendInputSteam(InputStream inputStream) throws IOException { int cumulative = 0; byte[] buffer = new byte[FOUR_KB]; int read; while ((read = inputStream.read(buffer)) > 0) { cumulative += read; if (read == FOUR_KB) { socket.sendMessage(RequestBody.create(WebSocket.BINARY, buffer)); } else { System.out.println("completed sending " + cumulative/16000 + " frames over socket"); socket.sendMessage(RequestBody.create(WebSocket.BINARY, Arrays.copyOfRange(buffer, 0, read)));
None of the sendMessage () parameters (sending binary content 0 lines long or sending a stop text message) seems to work. Caller ID does not change above. Result:
Waiting for STT callback ... Connected. Read audio frame from line: 16000 Written audio frame to pipe: 16000 Read audio frame from line: 16000 Written audio frame to pipe: 16000 completed sending 2 frames over socket onFailure: java.net.SocketException: Software caused connection abort: socket write error
REVISED: in fact, calling the end of the audio is not achieved. An exception occurs when writing the last (partial) buffer to the socket.
Why is the connection disconnected? This usually happens when the partner terminates the connection.
Regarding paragraph 2): will it be at this stage? It seems that the recognition process does not start at all ... The sound is valid (I wrote a stream to disk and was able to recognize it by placing it from a file, as I indicate above).
In addition, upon further consideration of the source code, WebSocketManager onMessage() already sends StopMessage immediately after return from sendInputSteam() (i.e., when the audio stream or pipe in the example above is sinks), so there is no need to call it explicitly. The problem, of course, occurs before the completion of the audio transmission. The behavior is the same regardless of whether the PipedInputStream or AudioInputStream as input. An exception is thrown when sending binary data in both cases.