A naive algorithm will not give good results when applied to real data. Here is a 20-line algorithm that uses relative word frequency to give accurate results for text in real text.
(If you want an answer to your original question that does not use word frequency, you need to clarify what exactly is meant by the “longest word”: is it better to have a 20-letter word and ten three-letter words, or is it better to have five 10-letter words words? After you stop at the exact definition, you just need to change the line defining the wordcost to reflect the intended value.)
Idea
The best way to continue is to model the distribution of the output. A good first approximation is to assume that all words are independently distributed. Then you need to know only the relative frequency of all words. It is reasonable to assume that they follow the Zipf law, that is, a word with rank n in the list of words has a probability of about 1 / (n log N), where N is the number of words in the dictionary.
Once you have fixed the model, you can use dynamic programming to determine the position of the spaces. The most likely sentence is one that maximizes the product of the probability of each individual word, and it is easy to calculate using dynamic programming. Instead of using probability directly, we use the value defined as the logarithm of the reciprocal of the probability to avoid overflow.
The code
from math import log # Build a cost dictionary, assuming Zipf law and cost = -math.log(probability). words = open("words-by-frequency.txt").read().split() wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words)) maxword = max(len(x) for x in words) def infer_spaces(s): """Uses dynamic programming to infer the location of spaces in a string without spaces.""" # Find the best match for the i first characters, assuming cost has # been built for the i-1 first characters. # Returns a pair (match_cost, match_length). def best_match(i): candidates = enumerate(reversed(cost[max(0, i-maxword):i])) return min((c + wordcost.get(s[ik-1:i], 9e999), k+1) for k,c in candidates) # Build the cost array. cost = [0] for i in range(1,len(s)+1): c,k = best_match(i) cost.append(c) # Backtrack to recover the minimal-cost string. out = [] i = len(s) while i>0: c,k = best_match(i) assert c == cost[i] out.append(s[ik:i]) i -= k return " ".join(reversed(out))
which you can use with
s = 'thumbgreenappleactiveassignmentweeklymetaphor' print(infer_spaces(s))
results
I use this quick and dirty dictionary of 125 thousand words, which I collected from a small subset of Wikipedia.
Prior to: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb a green apple. Active appointment of a weekly metaphor.
To: where there is information about the information about the parties mentioned above, odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery whetherthewordisreasonablesowhatsthefastestwayofextraction thanks.
After: there is a lot of textual information about people's comments, which is analyzed from html, but they do not have separator characters, for example, a green apple label. The active purpose of the weekly metaphor, apparently, is the fingers of a green apple, etc. in line, I also have a large dictionary for asking if the word is reasonable, so the fastest way to extract is thanks a lot.
Prior to: itwasadarkandstormynight therainfellintorrentsexceptatocial intervalswhenchwatchhecked theaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatsceneliesrattlingalhousetopsandfiercelyagitatingthescantyflameamphelthstthggaggt
After: there was a dark and stormy night when the rain fell in torrents, with the exception of random intervals, when it was checked by a strong gust of wind that swept through the streets, because there was a roar on the ground in London, and furiously excited the meager flame of the lamps that fought against the darkness.
As you can see, this is essentially flawless. The most important part is to make sure that your word list has been trained in a corpus similar to the one you really come across, otherwise the results will be very bad.
Optimization
The implementation consumes a linear amount of time and memory, so it is quite efficient. If you need extra speedups, you can build a suffix tree from a list of words to reduce the size of the candidate set.
If you need to process a very large sequential string, it would be wise to split the string to avoid excessive memory usage. For example, you can process text in blocks of 10,000 characters plus a 1000 character marker on each side to avoid boundary effects. This will minimize memory usage and will almost certainly not affect quality.