How to split text without spaces into a word list?

Entrance: "tableapplechairtablecupboard..." many words

What would be an effective algorithm for breaking such text into a list of words and get:

Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]

First of all, what you need to think about is to go through all possible words (starting with the first letter) and find the longest word, continue with position=word_position+len(word)

PS
We have a list of all possible words.
The word “cupboard” can be “cup” and “board”, the longest. Language: python, but the main thing is the algorithm itself.

+89
python split algorithm text
Jan 15 '12 at 14:10
source share
14 answers

A naive algorithm will not give good results when applied to real data. Here is a 20-line algorithm that uses relative word frequency to give accurate results for text in real text.

(If you want an answer to your original question that does not use word frequency, you need to clarify what exactly is meant by the “longest word”: is it better to have a 20-letter word and ten three-letter words, or is it better to have five 10-letter words words? After you stop at the exact definition, you just need to change the line defining the wordcost to reflect the intended value.)

Idea

The best way to continue is to model the distribution of the output. A good first approximation is to assume that all words are independently distributed. Then you need to know only the relative frequency of all words. It is reasonable to assume that they follow the Zipf law, that is, a word with rank n in the list of words has a probability of about 1 / (n log N), where N is the number of words in the dictionary.

Once you have fixed the model, you can use dynamic programming to determine the position of the spaces. The most likely sentence is one that maximizes the product of the probability of each individual word, and it is easy to calculate using dynamic programming. Instead of using probability directly, we use the value defined as the logarithm of the reciprocal of the probability to avoid overflow.

The code

 from math import log # Build a cost dictionary, assuming Zipf law and cost = -math.log(probability). words = open("words-by-frequency.txt").read().split() wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words)) maxword = max(len(x) for x in words) def infer_spaces(s): """Uses dynamic programming to infer the location of spaces in a string without spaces.""" # Find the best match for the i first characters, assuming cost has # been built for the i-1 first characters. # Returns a pair (match_cost, match_length). def best_match(i): candidates = enumerate(reversed(cost[max(0, i-maxword):i])) return min((c + wordcost.get(s[ik-1:i], 9e999), k+1) for k,c in candidates) # Build the cost array. cost = [0] for i in range(1,len(s)+1): c,k = best_match(i) cost.append(c) # Backtrack to recover the minimal-cost string. out = [] i = len(s) while i>0: c,k = best_match(i) assert c == cost[i] out.append(s[ik:i]) i -= k return " ".join(reversed(out)) 

which you can use with

 s = 'thumbgreenappleactiveassignmentweeklymetaphor' print(infer_spaces(s)) 



results

I use this quick and dirty dictionary of 125 thousand words, which I collected from a small subset of Wikipedia.

Prior to: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb a green apple. Active appointment of a weekly metaphor.

To: where there is information about the information about the parties mentioned above, odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery whetherthewordisreasonablesowhatsthefastestwayofextraction thanks.

After: there is a lot of textual information about people's comments, which is analyzed from html, but they do not have separator characters, for example, a green apple label. The active purpose of the weekly metaphor, apparently, is the fingers of a green apple, etc. in line, I also have a large dictionary for asking if the word is reasonable, so the fastest way to extract is thanks a lot.

Prior to: itwasadarkandstormynight therainfellintorrentsexceptatocial intervalswhenchwatchhecked theaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatsceneliesrattlingalhousetopsandfiercelyagitatingthescantyflameamphelthstthggaggt

After: there was a dark and stormy night when the rain fell in torrents, with the exception of random intervals, when it was checked by a strong gust of wind that swept through the streets, because there was a roar on the ground in London, and furiously excited the meager flame of the lamps that fought against the darkness.

As you can see, this is essentially flawless. The most important part is to make sure that your word list has been trained in a corpus similar to the one you really come across, otherwise the results will be very bad.




Optimization

The implementation consumes a linear amount of time and memory, so it is quite efficient. If you need extra speedups, you can build a suffix tree from a list of words to reduce the size of the candidate set.

If you need to process a very large sequential string, it would be wise to split the string to avoid excessive memory usage. For example, you can process text in blocks of 10,000 characters plus a 1000 character marker on each side to avoid boundary effects. This will minimize memory usage and will almost certainly not affect quality.

+159
Jul 25 '12 at 4:23
source share

Based on the excellent work in the top answer , I created the pip package for ease of use.

 >>> import wordninja >>> wordninja.split('derekanderson') ['derek', 'anderson'] 

To install, run pip install wordninja .

The only differences are minor. This returns a list , not a str , it works in python3 , it includes a list of words and splits correctly even if there are non-alpha characters (e.g. underscores, dashes, etc.).

Thanks again Generic Human!

https://github.com/keredson/wordninja

+35
Apr 20 '17 at 23:06 on
source share

Here is a solution using recursive search:

 def find_words(instring, prefix = '', words = None): if not instring: return [] if words is None: words = set() with open('/usr/share/dict/words') as f: for line in f: words.add(line.strip()) if (not prefix) and (instring in words): return [instring] prefix, suffix = prefix + instring[0], instring[1:] solutions = [] # Case 1: prefix in solution if prefix in words: try: solutions.append([prefix] + find_words(suffix, '', words)) except ValueError: pass # Case 2: prefix not in solution try: solutions.append(find_words(suffix, prefix, words)) except ValueError: pass if solutions: return sorted(solutions, key = lambda solution: [len(word) for word in solution], reverse = True)[0] else: raise ValueError('no solution') print(find_words('tableapplechairtablecupboard')) print(find_words('tableprechaun', words = set(['tab', 'table', 'leprechaun']))) 

gives

 ['table', 'apple', 'chair', 'table', 'cupboard'] ['tab', 'leprechaun'] 
+16
Jan 15 2018-12-15T00:
source share

Using a trie data structure that contains a list of possible words, it would not be too difficult to do the following:

  • Move pointer (in concatenated string)
  • Find and save the corresponding node in trie
  • If the trie node has children (for example, there are longer words), go to 1.
  • If node has no children, the longest word match occurs; add the word (stored in node or just concatenated during trie traversal) to the list of results, reset the pointer to trie (or reset link) and start with
+10
Jan 15 '12 at 14:25
source share

Unutbu's solution was pretty close, but I find the code difficult to read and it did not produce the expected result. A general human solution has the disadvantage that it needs word frequencies. Not suitable for all use cases.

Here's a simple solution using the Divide and Conquer algorithm .

  • He is trying to minimize the number of words . find_words('cupboard') will return ['cupboard'] , not ['cup', 'board'] (assuming cupboard , cup and board are in the dictionary string)
  • The optimal solution is not unique ; the implementation below returns a solution. find_words('charactersin') may return ['characters', 'in'] or may return ['character', 'sin'] (as shown below). You could easily change the algorithm to return all the optimal solutions.
  • These implementations implement memoized so that it works in a reasonable amount of time.

The code:

 words = set() with open('/usr/share/dict/words') as f: for line in f: words.add(line.strip()) solutions = {} def find_words(instring): # First check if instring is in the dictionnary if instring in words: return [instring] # No... But maybe it a result we already computed if instring in solutions: return solutions[instring] # Nope. Try to split the string at all position to recursively search for results best_solution = None for i in range(1, len(instring) - 1): part1 = find_words(instring[:i]) part2 = find_words(instring[i:]) # Both parts MUST have a solution if part1 is None or part2 is None: continue solution = part1 + part2 # Is the solution found "better" than the previous one? if best_solution is None or len(solution) < len(best_solution): best_solution = solution # Remember (memoize) this solution to avoid having to recompute it solutions[instring] = best_solution return best_solution 

It will take about 5 seconds on my 3GHz machine:

 result = find_words("thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearenodelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetaphorapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquerywhetherthewordisreasonablesowhatsthefastestwayofextractionthxalot") assert(result is not None) print ' '.join(result) 

reis a lot of textual information about people's comments, which is analyzed from html, but there is no separation character that sins them, for example, thumb green apple active appointment weekly metaphor, apparently there is a big green apple, etc. in line I also have a large dictionary for asking if the word is reasonable, so the fastest way to extract thanks a lot

+9
Jan 23 '14 at 12:33
source share

The answer https://stackoverflow.com/users/1515832/generic-human is great. But the best implementation of this I have ever seen was written by Peter Norwig himself in his book Beautiful Data.

Before embedding its code, let me talk about why the Norvig method is more accurate (albeit a bit slower and longer in terms of code).

1) The data is a little better - both in terms of size and in terms of accuracy (it uses the number of words, not a simple ranking) 2) More importantly, the logic of n-grams really makes such an approach so accurate.

The example he gives in his book is the problem of splitting the string "sitdown". Now the non-bigram method for splitting strings will consider p ('sit') * p ('down'), and if it is less than p ('sitdown') - which will happen quite often - it will NOT be split this, but we would like it (most of the time).

However, when you have a bigram model, you can evaluate p ('sit down') as bigram vs p ('sitdown'), and the old wins. Basically, if you do not use bigrams, it considers the probability of the words you split as independent, which is not so, some words will most likely appear one after another. Unfortunately, these are also words that often get stuck in many cases and confuse the splitter.

Here is a link to the data (data for three separate problems and segmentation is only one. Please read the chapter for details): http://norvig.com/ngrams/

and here is the link to the code: http://norvig.com/ngrams/ngrams.py

These links have been around for a while, but I will copy the fragmentation of the code here

 import re, string, random, glob, operator, heapq from collections import defaultdict from math import log10 def memo(f): "Memoize function f." table = {} def fmemo(*args): if args not in table: table[args] = f(*args) return table[args] fmemo.memo = table return fmemo def test(verbose=None): """Run some tests, taken from the chapter. Since the hillclimbing algorithm is randomized, some tests may fail.""" import doctest print 'Running tests...' doctest.testfile('ngrams-test.txt', verbose=verbose) ################ Word Segmentation (p. 223) @memo def segment(text): "Return a list of words that is the best segmentation of text." if not text: return [] candidates = ([first]+segment(rem) for first,rem in splits(text)) return max(candidates, key=Pwords) def splits(text, L=20): "Return a list of all possible (first, rem) pairs, len(first)<=L." return [(text[:i+1], text[i+1:]) for i in range(min(len(text), L))] def Pwords(words): "The Naive Bayes probability of a sequence of words." return product(Pw(w) for w in words) #### Support functions (p. 224) def product(nums): "Return the product of a sequence of numbers." return reduce(operator.mul, nums, 1) class Pdist(dict): "A probability distribution estimated from counts in datafile." def __init__(self, data=[], N=None, missingfn=None): for key,count in data: self[key] = self.get(key, 0) + int(count) self.N = float(N or sum(self.itervalues())) self.missingfn = missingfn or (lambda k, N: 1./N) def __call__(self, key): if key in self: return self[key]/self.N else: return self.missingfn(key, self.N) def datafile(name, sep='\t'): "Read key,value pairs from file." for line in file(name): yield line.split(sep) def avoid_long_words(key, N): "Estimate the probability of an unknown word." return 10./(N * 10**len(key)) N = 1024908267229 ## Number of tokens Pw = Pdist(datafile('count_1w.txt'), N, avoid_long_words) #### segment2: second version, with bigram counts, (p. 226-227) def cPw(word, prev): "Conditional probability of word, given previous word." try: return P2w[prev + ' ' + word]/float(Pw[prev]) except KeyError: return Pw(word) P2w = Pdist(datafile('count_2w.txt'), N) @memo def segment2(text, prev='<S>'): "Return (log P(words), words), where words is the best segmentation." if not text: return 0.0, [] candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first)) for first,rem in splits(text)] return max(candidates) def combine(Pfirst, first, (Prem, rem)): "Combine first and rem results into one (probability, words) pair." return Pfirst+Prem, [first]+rem 
+6
May 27 '16 at 15:16
source share

If you recompile the word list in DFA (which will be very slow), then the time required to enter the input will be proportional to the length of the line (in fact, only a little slower than just iterating over the line).

This is actually a more general version of the trie algorithm, which was mentioned earlier. I just mention this endlessly - at the moment there is no DFA implementation that you can use. RE2 will work, but I don't know if Python bindings can tune how large you allow DFA before it just throws out compiled DFA data and NFA lookups.

+2
Jan 15 2018-12-15T00:
source share

Here is the accepted answer translated into JavaScript (requires node.js and the file "wordninja_words.txt" from https://github.com/keredson/wordninja ):

 var fs = require("fs"); var splitRegex = new RegExp("[^a-zA-Z0-9']+", "g"); var maxWordLen = 0; var wordCost = {}; fs.readFile("./wordninja_words.txt", 'utf8', function(err, data) { if (err) { throw err; } var words = data.split('\n'); words.forEach(function(word, index) { wordCost[word] = Math.log((index + 1) * Math.log(words.length)); }) words.forEach(function(word) { if (word.length > maxWordLen) maxWordLen = word.length; }); console.log(maxWordLen) splitRegex = new RegExp("[^a-zA-Z0-9']+", "g"); console.log(split(process.argv[2])); }); function split(s) { var list = []; s.split(splitRegex).forEach(function(sub) { _split(sub).forEach(function(word) { list.push(word); }) }) return list; } module.exports = split; function _split(s) { var cost = [0]; function best_match(i) { var candidates = cost.slice(Math.max(0, i - maxWordLen), i).reverse(); var minPair = [Number.MAX_SAFE_INTEGER, 0]; candidates.forEach(function(c, k) { if (wordCost[s.substring(i - k - 1, i).toLowerCase()]) { var ccost = c + wordCost[s.substring(i - k - 1, i).toLowerCase()]; } else { var ccost = Number.MAX_SAFE_INTEGER; } if (ccost < minPair[0]) { minPair = [ccost, k + 1]; } }) return minPair; } for (var i = 1; i < s.length + 1; i++) { cost.push(best_match(i)[0]); } var out = []; i = s.length; while (i > 0) { var c = best_match(i)[0]; var k = best_match(i)[1]; if (c == cost[i]) console.log("Alert: " + c); var newToken = true; if (s.slice(i - k, i) != "'") { if (out.length > 0) { if (out[-1] == "'s" || (Number.isInteger(s[i - 1]) && Number.isInteger(out[-1][0]))) { out[-1] = s.slice(i - k, i) + out[-1]; newToken = false; } } } if (newToken) { out.push(s.slice(i - k, i)) } i -= k } return out.reverse(); } 
+1
Nov 07 '18 at 4:50
source share

It seems that this will be a fairly ordinary retreat. Start at the beginning of the line. Scan directly until the word appears. Then call the function in the rest of the line. The function returns false if it scans all the way to the right without recognizing the word. Otherwise, it returns the found word and the list of words returned by the recursive call.

Example: "tableapple". Finds the tab, then the jump, but the word ple doesn’t. No other word in "leapple". Finds the "table", then the "application". "le" is not a word, therefore apple tries, recognizes, returns.

To get the most out of it, go ahead, just emit (and not return) the right decisions; then select the optimal one according to any criterion you choose (maxmax, minmax, average, etc.)

0
Jan 15 '12 at 14:20
source share

Based on the unutbu solution, I implemented the Java version:

 private static List<String> splitWordWithoutSpaces(String instring, String suffix) { if(isAWord(instring)) { if(suffix.length() > 0) { List<String> rest = splitWordWithoutSpaces(suffix, ""); if(rest.size() > 0) { List<String> solutions = new LinkedList<>(); solutions.add(instring); solutions.addAll(rest); return solutions; } } else { List<String> solutions = new LinkedList<>(); solutions.add(instring); return solutions; } } if(instring.length() > 1) { String newString = instring.substring(0, instring.length()-1); suffix = instring.charAt(instring.length()-1) + suffix; List<String> rest = splitWordWithoutSpaces(newString, suffix); return rest; } return Collections.EMPTY_LIST; } 

Entrance: "tableapplechairtablecupboard"

Output: [table, apple, chair, table, cupboard]

Entrance: "tableprechaun"

Output: [tab, leprechaun]

0
Mar 13 '14 at 5:38
source share

For German, there is CharSplit, which uses machine learning and works pretty well for multi-word strings.

https://github.com/dtuggener/CharSplit

0
Jun 28 '18 at 9:10
source share

Extending @miku 's suggestion of using Trie , Trie , which is available only for adding, is relatively simple to implement in python :

 class Node: def __init__(self, is_word=False): self.children = {} self.is_word = is_word class TrieDictionary: def __init__(self, words=tuple()): self.root = Node() for word in words: self.add(word) def add(self, word): node = self.root for c in word: node = node.children.setdefault(c, Node()) node.is_word = True def lookup(self, word, from_node=None): node = self.root if from_node is None else from_node for c in word: try: node = node.children[c] except KeyError: return None return node 

Then we can build the Trie -based dictionary from a set of words:

 dictionary = {"a", "pea", "nut", "peanut", "but", "butt", "butte", "butter"} trie_dictionary = TrieDictionary(words=dictionary) 

Which will create a tree that looks like this ( * indicates the beginning or end of a word):

 * -> a* \\\ \\\-> p -> e -> a* \\ \-> n -> u -> t* \\ \\-> b -> u -> t* \\ \-> t* \\ \-> e* \\ \-> r* \ \-> n -> u -> t* 

We can incorporate this into the solution by combining it with a heuristic on how to choose words. For example, we may prefer longer words to shorter ones:

 def using_trie_longest_word_heuristic(s): node = None possible_indexes = [] # O(1) short-circuit if whole string is a word, does not go against longest-word wins if s in dictionary: return [ s ] for i in range(len(s)): # traverse the trie, char-wise to determine intermediate words node = trie_dictionary.lookup(s[i], from_node=node) # no more words start this way if node is None: # iterate words we have encountered from biggest to smallest for possible in possible_indexes[::-1]: # recurse to attempt to solve the remaining sub-string end_of_phrase = using_trie_longest_word_heuristic(s[possible+1:]) # if we have a solution, return this word + our solution if end_of_phrase: return [ s[:possible+1] ] + end_of_phrase # unsolvable break # if this is a leaf, append the index to the possible words list elif node.is_word: possible_indexes.append(i) # empty string OR unsolvable case return [] 

We can use this function as follows:

 >>> using_trie_longest_word_heuristic("peanutbutter") [ "peanut", "butter" ] 

Since we maintain our position in Trie when searching for longer and longer words, we cross trie no more than once for a possible solution (and not 2 times for peanut : pea , peanut ). The last short circuit saves us from going through the line in the worst case.

The final result is just a few checks:

 'peanutbutter' - not a word, go charwise 'p' - in trie, use this node 'e' - in trie, use this node 'a' - in trie and edge, store potential word and use this node 'n' - in trie, use this node 'u' - in trie, use this node 't' - in trie and edge, store potential word and use this node 'b' - not in trie from 'peanut' vector 'butter' - remainder of longest is a word 

The advantage of this solution is that you will very quickly find out if longer words exist with a given prefix, which eliminates the need for rigorous testing of sequence combinations in the dictionary. It also makes getting an unsolvable response relatively cheap compared to other implementations.

The disadvantages of this solution are the large amount of memory for trie and the cost of building trie in advance.

0
Aug 23 '18 at 18:57
source share

If you have an exhaustive list of words contained in a string:

word_list = ["table", "apple", "chair", "cupboard"]

Using list comprehension to iterate through a list to find a word and how many times it appears.

 string = "tableapplechairtablecupboard" def split_string(string, word_list): return ("".join([(item + " ")*string.count(item.lower()) for item in word_list if item.lower() in string])).strip() 

The function returns the stringoutput of words in list ordertable table apple chair cupboard

0
Aug 06 '19 at 19:38
source share

You need to define your vocabulary - perhaps any free list of words will do.

After that, use this dictionary to create a suffix tree and map your input stream to this: http://en.wikipedia.org/wiki/Suffix_tree

-one
Jan 15 '12 at 14:25
source share



All Articles