Parsing words in a continuous line

Question

Parsing words in a continuous line

If a has a line with words and without spaces, how should I analyze these words, given that I have a dictionary / list containing these words?

For example, if my string is "thisisastringwordswords", how can I use the dictionary to generate the output "is this string with words"?

I heard that using the Tries data structure could help, but maybe if someone could help with the pseudo-code? For example, I thought that perhaps you could index the dictionary into a trie structure, and then follow each char down the trie; the problem is that I am not familiar with how to do this in (pseudo) code.

+8

c # algorithm data-structures text-segmentation

locoboy Jun 20 '11 at 7:54

source share

7 answers

Zruty · Answer 1 · 2011-06-20T08:02:39+0000

I assume that you want an effective solution, not an obvious one, when you repeatedly check if your text begins with a dictionary.

If the dictionary is small enough, I think you could try changing the standard KMP algorithm. Basically, create a state machine on your dictionary that consumes a text character by character and displays the constructed words.

EDIT: It turned out that I'm inventing trying .

goldengel · Answer 2 · 2011-06-20T08:09:25+0000

I already did something similar. You cannot use a simple dictionary. The result will be erratic. It depends if you need to do this only once or as an entire program.

My solution was:

Connect to a database with working words from a list of dictionaries (for an example of an online dictionary)
Filter long and short words in the dictionary and check if you want to trim things (for example, do not use words with only one character, for example 'I' )
Start with short words and compare your bigString file with the database dictionary.

Now you need to create an “opportunity table”. Because many words can fit into 100%, but make mistakes. The longer the word, the more confident that the word is correct.

It is an intensive processor, but it can work exactly as a result. So, let's say you use a small dictionary of 10,000 words, and 3,000 of them are 8 characters long, you need to compare your large line at the beginning with all 3,000 words, and only if the result is found, it is allowed to go to the next word. If you have 200 characters in your large string, you need (2000chars / 8 medium characters) = 250 full minimum cycles when comparing.

For me, I also did a little check on misspelled words in comparison.

sample procedure (do not copy paste)

Dim bigString As String = "helloworld.thisisastackoverflowtest!" Dim dictionary As New List(Of String) 'contains the original words. lets make it case insentitive dictionary.Add("Hello") dictionary.Add("World") dictionary.Add("this") dictionary.Add("is") dictionary.Add("a") dictionary.Add("stack") dictionary.Add("over") dictionary.Add("flow") dictionary.Add("stackoverflow") dictionary.Add("test") dictionary.Add("!") For Each word As String In dictionary If word.Length < 1 Then dictionary.Remove(word) 'remove short words (will not work with for each in real) word = word.ToLower 'make it case insentitive Next Dim ResultComparer As New Dictionary(Of String, Double) 'String is the dictionary word. Double is a value as percent for a own function to weight result Dim i As Integer = 0 'start at the beginning Dim Found As Boolean = False Do For Each word In dictionary If bigString.IndexOf(word, i) > 0 Then ResultComparer.Add(word, MyWeightOfWord) 'add the word if found, long words are better and will increase the weight value Found = True End If Next If Found = True Then i += ResultComparer(BestWordWithBestWeight).Length Else i += 1 End If Loop

Ramonboza · Answer 3 · 2011-06-20T08:00:15+0000

If you are sure that you have all the phrase words in the dictionary, you can use this algorithm:

 String phrase = "thisisastringwithwords"; String fullPhrase = ""; Set<String> myDictionary; do { foreach(item in myDictionary){ if(phrase.startsWith(item){ fullPhrase += item + " "; phrase.remove(item); break; } } } while(phrase.length != 0);

There are so many complications, for example, some elements starting the same way, so the code will be modified to use some kind of tree search, BST or so.

amod · Answer 4 · 2011-06-20T08:00:27+0000

I told you that this seems like an impossible task. But you can take a look at this related SO question - this may help you.

Handcraftsman · Answer 5 · 2011-06-20T14:15:40+0000

This is the exact problem when trying to programmatically analyze languages such as Chinese, where there are no spaces between words. One method that works with these languages is to start by splitting the text into punctuation. It gives you phrases. Then you go through the phrases and try to break them into words starting with the length of the longest word in the dictionary. Say the length is 13 characters. Take the first 13 characters from the phrases and see if it is in your dictionary. If so, take it as the right word at the moment, move forward in the phrase and repeat. Otherwise, reduce the substring to 12 characters, then 11 characters, etc.

This works very well, but not perfect, because we accidentally biased the words that come first. One way to eliminate this bias and double-check your result is to repeat the process starting at the end of the phrase. If you get the same word breaks, you can probably call it good. If not, you have an overlapping segment of words. For example, when you analyze your sample phrase, starting from the end, you can get (back for emphasis)

 words with string a Isis th

First, the word Isis (Egyptian goddess) seems to be the right word. However, when you find that “th” is not in your dictionary, you know that there is a problem with word segmentation nearby. Solve this by moving the "this" forward with the segmentation result for the unaligned "thisis" sequence, since both words are in the dictionary.

A less common version of this problem is that related words share a sequence that can go anyway. If you had a sequence like "archand" (to do something), should it be an "arc of a hand" or an "arch"? The way to determine this is to apply grammar checking to the results. This should still be done for the entire text.

kyun · Answer 6 · 2011-06-23T00:52:24+0000

OK, I will make a thrilling manual attempt. The ideal (ish) data structure for your problem (as you said, trie) consists of words in a dictionary. Three are best visualized as DFA - a good state machine in which you switch from one state to another on each new character. This is really easy to do in code, the Java (ish) style class for this would be:

 Class State { String matchedWord; Map<char,State> mapChildren; }

From this point of view, creating a trie is easy. This is similar to a root tree structure with each node having multiple children. Each child is visited with the transition of one character. Using the HashMap structure reduces the search time to the next State character. Alternatively, if all you have is 26 characters for the alphabet, a fixed size array of 26 will do the trick too.

Now, assuming all of this makes sense, you have the trick, your problem is still not completely resolved. Here you start to do something like regex engines, go down the trie, track the states that correspond to the whole word in the dictionary (this is what I had matchedWord for in the State structure), use some reverse tracking logic to go to the previous one a match condition if the current trace is stalled. I know his general, but given the structure of the trie, the rest is pretty simple.

Paul dixon · Answer 7 · 2012-01-08T09:58:43+0000

If you have a dictionary of words and need a quick implication, this can be effectively resolved using dynamic programming in O (n ^ 2), assuming the dictionary search is O (1). Below is the C # code, it is possible to extract a substring and search for dictionaries.

 public static String[] StringToWords(String str, HashSet<string> words) { //Index of char - length of last valid word int[] bps = new int[str.Length + 1]; for (int i = 0; i < bps.Length; i++) bps[i] = -1; for (int i = 0; i < str.Length; i++) { for (int j = i + 1; j <= str.Length ; j++) { if (bps[j] == -1) { //Destination cell doesn't have valid backpointer yet //Try with the current substring String s = str.Substring(i, j - i); if (words.Contains(s)) bps[j] = i; } } } //Backtrack to recovery sequence and then reverse List<String> seg = new List<string>(); for (int bp = str.Length; bps[bp] != -1 ;bp = bps[bp]) seg.Add(str.Substring(bps[bp], bp - bps[bp])); seg.Reverse(); return seg.ToArray(); }

Creating a hastset with a list of words from / usr / share / dict / words and testing with

 foreach (var s in StringSplitter.StringToWords("thisisastringwithwords", dict)) Console.WriteLine(s);

I get the output "t hi sis string with words". Since, as others have pointed out, this algorithm will return the actual segmentation (if one exists), however this may not be the expected segmentation. Having short words reduces the quality of segmentation; you can add heuristics in favor of longer words if two valid sub-segments are included in the element.

There are more sophisticated methods that state machines and language models can generate multiple segments and apply probabilistic rating.

Parsing words in a continuous line

More articles: