Overview
I send Strings to a Text-to-Speech server that accepts a maximum length of 300 characters. Due to network latency, there may be a delay between each section of speech, so I would like to break my speech in the most “natural pauses” whenever possible.
Each request to the server costs me money, so ideally I would send the longest line, up to the maximum allowed characters.
Here is my current implementation:
private static final boolean DEBUG = true; private static final int MAX_UTTERANCE_LENGTH = 298; private static final int MIN_UTTERANCE_LENGTH = 200; private static final String FULL_STOP_SPACE = ". "; private static final String QUESTION_MARK_SPACE = "? "; private static final String EXCLAMATION_MARK_SPACE = "! "; private static final String LINE_SEPARATOR = System.getProperty("line.separator"); private static final String COMMA_SPACE = ", "; private static final String JUST_A_SPACE = " "; public static ArrayList<String> splitUtteranceNaturalBreaks(String utterance) { final long then = System.nanoTime(); final ArrayList<String> speakableUtterances = new ArrayList<String>(); int splitLocation = 0; String success = null; while (utterance.length() > MAX_UTTERANCE_LENGTH) { splitLocation = utterance.lastIndexOf(FULL_STOP_SPACE, MAX_UTTERANCE_LENGTH); if (DEBUG) { System.out.println("(0 FULL STOP) - last index at: " + splitLocation); } if (splitLocation < MIN_UTTERANCE_LENGTH) { if (DEBUG) { System.out.println("(1 FULL STOP) - NOT_OK"); } splitLocation = utterance.lastIndexOf(QUESTION_MARK_SPACE, MAX_UTTERANCE_LENGTH); if (DEBUG) { System.out.println("(1 QUESTION MARK) - last index at: " + splitLocation); } if (splitLocation < MIN_UTTERANCE_LENGTH) { if (DEBUG) { System.out.println("(2 QUESTION MARK) - NOT_OK"); } splitLocation = utterance.lastIndexOf(EXCLAMATION_MARK_SPACE, MAX_UTTERANCE_LENGTH); if (DEBUG) { System.out.println("(2 EXCLAMATION MARK) - last index at: " + splitLocation); } if (splitLocation < MIN_UTTERANCE_LENGTH) { if (DEBUG) { System.out.println("(3 EXCLAMATION MARK) - NOT_OK"); } splitLocation = utterance.lastIndexOf(LINE_SEPARATOR, MAX_UTTERANCE_LENGTH); if (DEBUG) { System.out.println("(3 SEPARATOR) - last index at: " + splitLocation); } if (splitLocation < MIN_UTTERANCE_LENGTH) { if (DEBUG) { System.out.println("(4 SEPARATOR) - NOT_OK"); } splitLocation = utterance.lastIndexOf(COMMA_SPACE, MAX_UTTERANCE_LENGTH); if (DEBUG) { System.out.println("(4 COMMA) - last index at: " + splitLocation); } if (splitLocation < MIN_UTTERANCE_LENGTH) { if (DEBUG) { System.out.println("(5 COMMA) - NOT_OK"); } splitLocation = utterance.lastIndexOf(JUST_A_SPACE, MAX_UTTERANCE_LENGTH); if (DEBUG) { System.out.println("(5 SPACE) - last index at: " + splitLocation); } if (splitLocation < MIN_UTTERANCE_LENGTH) { if (DEBUG) { System.out.println("(6 SPACE) - NOT_OK"); } splitLocation = MAX_UTTERANCE_LENGTH; if (DEBUG) { System.out.println("(6 MAX_UTTERANCE_LENGTH) - last index at: " + splitLocation); } } else { if (DEBUG) { System.out.println("Accepted"); } splitLocation -= 1; } } } else { if (DEBUG) { System.out.println("Accepted"); } splitLocation -= 1; } } else { if (DEBUG) { System.out.println("Accepted"); } } } else { if (DEBUG) { System.out.println("Accepted"); } } } else { if (DEBUG) { System.out.println("Accepted"); } } success = utterance.substring(0, (splitLocation + 2)); speakableUtterances.add(success.trim()); if (DEBUG) { System.out.println("Split - Length: " + success.length() + " -:- " + success); System.out.println("------------------------------"); } utterance = utterance.substring((splitLocation + 2)).trim(); } speakableUtterances.add(utterance); if (DEBUG) { System.out.println("Split - Length: " + utterance.length() + " -:- " + utterance); final long now = System.nanoTime(); final long elapsed = now - then; System.out.println("ELAPSED: " + TimeUnit.MILLISECONDS.convert(elapsed, TimeUnit.NANOSECONDS)); } return speakableUtterances; }
This is ugly due to the inability to use a regular expression in lastIndexOf . Awful aside, it's actually pretty fast.
Problems
Ideally, I would like to use a regex that allows a match on one of my first choice delimiters:
private static final String firstChoice = "[.!?" + LINE_SEPARATOR + "]\\s+"; private static final Pattern pFirstChoice = Pattern.compile(firstChoice);
And then use the match to resolve the position:
Matcher matcher = pFirstChoice.matcher(input); if (matcher.find()) { splitLocation = matcher.start(); }
My alternative in my current implementation is to save the location of each separator and then select the closest to MAX_UTTERANCE_LENGTH
I tried various methods to apply MIN_UTTERANCE_LENGTH and MAX_UTTERANCE_LENGTH to the template, so it only fixes between these values and uses the inverse to reverse iteration ?<= , But this is where my knowledge starts to torment me
private static final String poorEffort = "([.!?]{200, 298})\\s+");
Finally
I wonder if any of you regular expression masters can achieve what I need and confirm whether it really turns out to be more effective?
I thank you in advance.
Literature: