This is the exact problem when trying to programmatically analyze languages such as Chinese, where there are no spaces between words. One method that works with these languages is to start by splitting the text into punctuation. It gives you phrases. Then you go through the phrases and try to break them into words starting with the length of the longest word in the dictionary. Say the length is 13 characters. Take the first 13 characters from the phrases and see if it is in your dictionary. If so, take it as the right word at the moment, move forward in the phrase and repeat. Otherwise, reduce the substring to 12 characters, then 11 characters, etc.
This works very well, but not perfect, because we accidentally biased the words that come first. One way to eliminate this bias and double-check your result is to repeat the process starting at the end of the phrase. If you get the same word breaks, you can probably call it good. If not, you have an overlapping segment of words. For example, when you analyze your sample phrase, starting from the end, you can get (back for emphasis)
words with string a Isis th
First, the word Isis (Egyptian goddess) seems to be the right word. However, when you find that “th” is not in your dictionary, you know that there is a problem with word segmentation nearby. Solve this by moving the "this" forward with the segmentation result for the unaligned "thisis" sequence, since both words are in the dictionary.
A less common version of this problem is that related words share a sequence that can go anyway. If you had a sequence like "archand" (to do something), should it be an "arc of a hand" or an "arch"? The way to determine this is to apply grammar checking to the results. This should still be done for the entire text.
Handcraftsman
source share