Split line for natural language breaks

Question

Split line for natural language breaks

Overview

I send Strings to a Text-to-Speech server that accepts a maximum length of 300 characters. Due to network latency, there may be a delay between each section of speech, so I would like to break my speech in the most “natural pauses” whenever possible.

Each request to the server costs me money, so ideally I would send the longest line, up to the maximum allowed characters.

Here is my current implementation:

private static final boolean DEBUG = true; private static final int MAX_UTTERANCE_LENGTH = 298; private static final int MIN_UTTERANCE_LENGTH = 200; private static final String FULL_STOP_SPACE = ". "; private static final String QUESTION_MARK_SPACE = "? "; private static final String EXCLAMATION_MARK_SPACE = "! "; private static final String LINE_SEPARATOR = System.getProperty("line.separator"); private static final String COMMA_SPACE = ", "; private static final String JUST_A_SPACE = " "; public static ArrayList<String> splitUtteranceNaturalBreaks(String utterance) { final long then = System.nanoTime(); final ArrayList<String> speakableUtterances = new ArrayList<String>(); int splitLocation = 0; String success = null; while (utterance.length() > MAX_UTTERANCE_LENGTH) { splitLocation = utterance.lastIndexOf(FULL_STOP_SPACE, MAX_UTTERANCE_LENGTH); if (DEBUG) { System.out.println("(0 FULL STOP) - last index at: " + splitLocation); } if (splitLocation < MIN_UTTERANCE_LENGTH) { if (DEBUG) { System.out.println("(1 FULL STOP) - NOT_OK"); } splitLocation = utterance.lastIndexOf(QUESTION_MARK_SPACE, MAX_UTTERANCE_LENGTH); if (DEBUG) { System.out.println("(1 QUESTION MARK) - last index at: " + splitLocation); } if (splitLocation < MIN_UTTERANCE_LENGTH) { if (DEBUG) { System.out.println("(2 QUESTION MARK) - NOT_OK"); } splitLocation = utterance.lastIndexOf(EXCLAMATION_MARK_SPACE, MAX_UTTERANCE_LENGTH); if (DEBUG) { System.out.println("(2 EXCLAMATION MARK) - last index at: " + splitLocation); } if (splitLocation < MIN_UTTERANCE_LENGTH) { if (DEBUG) { System.out.println("(3 EXCLAMATION MARK) - NOT_OK"); } splitLocation = utterance.lastIndexOf(LINE_SEPARATOR, MAX_UTTERANCE_LENGTH); if (DEBUG) { System.out.println("(3 SEPARATOR) - last index at: " + splitLocation); } if (splitLocation < MIN_UTTERANCE_LENGTH) { if (DEBUG) { System.out.println("(4 SEPARATOR) - NOT_OK"); } splitLocation = utterance.lastIndexOf(COMMA_SPACE, MAX_UTTERANCE_LENGTH); if (DEBUG) { System.out.println("(4 COMMA) - last index at: " + splitLocation); } if (splitLocation < MIN_UTTERANCE_LENGTH) { if (DEBUG) { System.out.println("(5 COMMA) - NOT_OK"); } splitLocation = utterance.lastIndexOf(JUST_A_SPACE, MAX_UTTERANCE_LENGTH); if (DEBUG) { System.out.println("(5 SPACE) - last index at: " + splitLocation); } if (splitLocation < MIN_UTTERANCE_LENGTH) { if (DEBUG) { System.out.println("(6 SPACE) - NOT_OK"); } splitLocation = MAX_UTTERANCE_LENGTH; if (DEBUG) { System.out.println("(6 MAX_UTTERANCE_LENGTH) - last index at: " + splitLocation); } } else { if (DEBUG) { System.out.println("Accepted"); } splitLocation -= 1; } } } else { if (DEBUG) { System.out.println("Accepted"); } splitLocation -= 1; } } else { if (DEBUG) { System.out.println("Accepted"); } } } else { if (DEBUG) { System.out.println("Accepted"); } } } else { if (DEBUG) { System.out.println("Accepted"); } } success = utterance.substring(0, (splitLocation + 2)); speakableUtterances.add(success.trim()); if (DEBUG) { System.out.println("Split - Length: " + success.length() + " -:- " + success); System.out.println("------------------------------"); } utterance = utterance.substring((splitLocation + 2)).trim(); } speakableUtterances.add(utterance); if (DEBUG) { System.out.println("Split - Length: " + utterance.length() + " -:- " + utterance); final long now = System.nanoTime(); final long elapsed = now - then; System.out.println("ELAPSED: " + TimeUnit.MILLISECONDS.convert(elapsed, TimeUnit.NANOSECONDS)); } return speakableUtterances; }

This is ugly due to the inability to use a regular expression in lastIndexOf . Awful aside, it's actually pretty fast.

Problems

Ideally, I would like to use a regex that allows a match on one of my first choice delimiters:

 private static final String firstChoice = "[.!?" + LINE_SEPARATOR + "]\\s+"; private static final Pattern pFirstChoice = Pattern.compile(firstChoice);

And then use the match to resolve the position:

  Matcher matcher = pFirstChoice.matcher(input); if (matcher.find()) { splitLocation = matcher.start(); }

My alternative in my current implementation is to save the location of each separator and then select the closest to MAX_UTTERANCE_LENGTH

I tried various methods to apply MIN_UTTERANCE_LENGTH and MAX_UTTERANCE_LENGTH to the template, so it only fixes between these values and uses the inverse to reverse iteration ?<= , But this is where my knowledge starts to torment me

private static final String poorEffort = "([.!?]{200, 298})\\s+");

Finally

I wonder if any of you regular expression masters can achieve what I need and confirm whether it really turns out to be more effective?

I thank you in advance.

Literature:

+6

java string regex

brandall Apr 29 '14 at 0:50

source share

2 answers

morja · Answer 1 · 2014-04-29T03:55:36+0000

I would do something like this:

 Pattern p = Pattern.compile(".{1,299}(?:[.!?]\\s+|\\n|$)", Pattern.DOTALL); Matcher matcher = p.matcher(text); while (matcher.find()) { speakableUtterances.add(matcher.group().trim()); }

Regular expression explanation:

 .{1,299} any character between 1 and 299 times (matching the most amount possible) (?:[.!?]\\s+|\\n|$) followed by either .!? and whitespaces, a newline or the end of the string

You may want to consider punctuation to decrypt \p{Punct} , see. Javadoc for Pattern A .

You can see the working sample ideone .

Bonzaithepenguin · Answer 2 · 2014-04-29T04:21:51+0000

Unicode standard defines how you should break text in sentences and other logical components. Here's some working pseudo code:

 // tests two consecutive codepoints within the text to detect the end of sentences boolean continueSentence(Text text, Range range1, Range range2) { Code code1 = text.code(range1), code2 = text.code(range2); // 0.2 sot ÷ if (code1.isStartOfText()) return false; // 0.3 ÷ eot if (code2.isEndOfText()) return false; // 3.0 CR × LF if (code1.isCR() && code2.isLF()) return true; // 4.0 (Sep | CR | LF) ÷ if (code1.isSep() || code1.isCR() || code1.isLF()) return false; // 5.0 × [Format Extend] if (code2.isFormat() || code2.isExtend()) return true; // 6.0 ATerm × Numeric if (code1.isATerm() && (code2.isDigit() || code2.isDecimal() || code2.isNumeric())) return true; // 7.0 Upper ATerm × Upper if (code2.isUppercase() && code1.isATerm()) { Range range = text.previousCode(range1); if (range.isValid() && text.code(range).isUppercase()) return true; } boolean allow_STerm = true, return_value = true; // 8.0 ATerm Close* Sp* × [^ OLetter Upper Lower Sep CR LF STerm ATerm]* Lower Range range = range2; Code code = code2; while (!code.isOLetter() && !code.isUppercase() && !code.isLowercase() && !code.isSep() && !code.isCR() && !code.isLF() && !code.isSTerm() && !code.isATerm()) { if (!(range = text.nextCode(range)).isValid()) break; code = text.code(range); } range = range1; if (code.isLowercase()) { code = code1; allow_STerm = true; goto Sp_Close_ATerm; } code = code1; // 8.1 (STerm | ATerm) Close* Sp* × (SContinue | STerm | ATerm) if (code2.isSContinue() || code2.isSTerm() || code2.isATerm()) goto Sp_Close_ATerm; // 9.0 ( STerm | ATerm ) Close* × ( Close | Sp | Sep | CR | LF ) if (code2.isClose()) goto Close_ATerm; // 10.0 ( STerm | ATerm ) Close* Sp* × ( Sp | Sep | CR | LF ) if (code2.isSp() || code2.isSep() || code2.isCR() || code2.isLF()) goto Sp_Close_ATerm; // 11.0 ( STerm | ATerm ) Close* Sp* (Sep | CR | LF)? ÷ return_value = false; // allow Sep, CR, or LF zero or one times for (int iteration = 1; iteration != 0; iteration--) { if (!code.isSep() && !code.isCR() && !code.isLF()) goto Sp_Close_ATerm; if (!(range = text.previousCode(range)).isValid()) goto Sp_Close_ATerm; code = text.code(range); } Sp_Close_ATerm: // allow zero or more Sp while (code.isSp() && (range = text.previousCode(range)).isValid()) code = text.code(range); Close_ATerm: // allow zero or more Close while (code.isClose() && (range = text.previousCode(range)).isValid()) code = text.code(range); // require STerm or ATerm if (code.isATerm() || (allow_STerm && code.isSTerm())) return return_value; // 12.0 × Any return true; }

Then you can iterate over sentences like this:

 // pass in a range of (0, 0) to get the range of the first sentence // returns a range with a length of 0 if there are no more sentences Range nextSentence(Text text, Range range) { try_again: range = text.nextCode(new Range(range.start + range.length, 0)); if (!range.isValid()) return range; Range next = text.nextCode(range); long start = range.start; while (next.isValid()) && text.continueSentence(range, next)) next = text.nextCode(range = next); range = new Range(start, range.start + range.length - start); Range range2 = text.trimRange(range); if (!range2.isValid()) goto try_again; return range2; }

Where:

The range is defined as the range from> = start and <start + length
text.trimRange removes whitespace (optional)
all Code.is [Type] functions are searches in the Unicode character database . For example, you will see in some of these files that some code points are defined as "CR", "Sep", "StartOfText", etc.
Text.code (range) decodes the code in the text in range.start. Length not used.
Text.nextCode and Text.previousCode return the range of the next or previous code point within the string, based on the range of the current code point. If there is no code in this direction, it returns an invalid range, which is a range with a length of 0.

The standard also defines how to iterate over words, lines, and characters .

Split line for natural language breaks

More articles: