I am reading a Unicode stream and should not pass the entire line through the regex. Is there a simple (reliable) character that I can use to break words in different languages?
My byte array is likely to be based in UTF-16 or UTF-8
If you use Java, you can use BreakIterator .