Checking if a string contains an English sentence

Question

Checking if a string contains an English sentence

Currently, I decided to take the dictionary and go through all this. Each time I see a new line, I make a line containing this new line, the next line of a new line, after which I do string.find () to see that it is an English word somewhere there. It takes a VERY long time, each word takes about 1 / 2-1 / 4 seconds to check.

It works fine, but I need to check thousands of words per second. I can run multiple windows that do not affect speed (multithreading), but it still only checks for 10 seconds. (I need thousands)

I'm currently writing code to precompile a large array containing every word in English, which should speed it up but still not get the speed I want. There must be a better way to do this.

The lines I'm checking will look like this:

"hithisisastringthatmustbechecked"

but most of them contained complete garbage, just random letters.

I can’t check the impossibility of composing letters, because this line will be selected because of “tm”, between “thatmust”.

+7

c ++ string linguistics

Nicholas pipitone Aug 2 '13 at 17:41

source share

5 answers

dasblinkenlight · Answer 1 · 2013-08-02T17:50:43+0000

You can speed up your search using the Knuth-Morris-Pratt (KMP) algorithm.

Go through each dictionary word and build a lookup table for it . You only need to do this once. Now the search for individual words will continue faster, because the "false starts" will be eliminated.

Escualo · Answer 2 · 2013-08-02T18:17:30+0000

What about bloom filter ?

The Bloom Filter, conceived by Burton Howard Bloom in 1970, is a spatially effective probabilistic data structure that is used to test whether an element is an element of a collection. False positive coincidences are possible, but there are no false negatives; that is, the query returns either "inside the set (it may be wrong)" or "definitely not in the set." Elements can be added to the set but not deleted (although this can be solved with a "counting" filter). The more elements added to set, the greater the likelihood of false positives.

The approach can work as follows: you create a set of words that you want to check (this is done only once), and then you can quickly run an "in / not-in" check for each substring. If the result is "not-in", you can continue (Bloom filters do not give false negatives). If the result is “in,” then you perform a more complex check to confirm (Bloom filters can give false positives).

As I understand it, some controllers rely on flowering filters to quickly check if your last word belongs to a dictionary of famous words.

Lee meador · Answer 3 · 2013-08-02T17:56:27+0000

There are many strategies for this.

Idea 1

Take the line you are looking for and make a copy of each possible substring, starting from some column and continuing the entire line. Then save each in an array indexed by the letter it starts with. (If a letter is used, save the longer substring twice.

So the array looks like this:

 a - substr[0] = "astringthatmustbechecked" b - substr[1] = "bechecked" c - substr[2] = "checked" d - substr[3] = "d" e - substr[4] = "echecked" f - substr[5] = null // since there is no 'f' in it ... and so forth

Then, for each word in the dictionary, find the element of the array indicated by its first letter. This limits the number of things to look for. In addition, you can never find a word starting with, say, “r,” anywhere before the first “r” in a line. And some words will not even search if there is no letter at all.

Idea 2

Expand this idea by marking the longest word in the dictionary and get rid of the letters from these lines in arrays that are longer than this distance.

So you have this in an array:

 a - substr[0] = "astringthatmustbechecked"

But if the longest word in the list is 5 letters, there is no need to store more than:

 a - substr[0] = "astri"

If the letter is present several times, you must store more letters. Thus, it is necessary to save the whole line, because "e" continues to be displayed less than 5 letters.

 e - substr[4] = "echecked"

You can expand this by using the longest words starting with any particular letter when condensing strings.

Idea 3

This has nothing to do with 1 and 2. Its an idea that you could use instead.

You can turn a dictionary into a kind of regular expression stored in a related data structure. You can also write a regular expression and then apply it.

Suppose these are words in a dictionary:

 arun bob bill billy body jose

Create a similar structure. (His binary tree is indeed represented in such a way that I can explain how to use it.)

 a -> r -> u -> n -> * | b -> i -> l -> l -> * | | | | o -> b -> * y -> * | | | d -> y -> * | j -> o -> s -> e -> *

The arrows indicate the letter that should follow another letter. Therefore, "r" must be after "a" or cannot match.

Lines going down indicate an option. You have “a or b or j” possible letters, and then “i or o” possible letters after “b”.

The regular expression looks like: / (arun) | (b (ill (y +)) | (o (b | dy))) | (jose) / (although I might have slipped). This makes sense to create it as a regular expression.

Once you create this structure, you apply it to your row, starting from the first column. Try to complete the match by checking the alternatives, and if one of them matches, more forward, and try the letter after the arrow and its alternatives. If you reach a star / sprocket, it will match. If you have run out of alternatives, including the inverse, you move on to the next column.

This is a lot of work, but sometimes it can be convenient.

Side note. I built one of them some time ago by writing a program that wrote code that ran the algorithm directly, instead of having code looking at the data structure of the binary tree.

Consider that each set of vertical panel options is a switch for a particular character column, and each arrow turns into a nesting box. If there is only one option, you do not need the full switch , just if .

It was a quick match of characters and really useful for some reason that eludes me today.

Superbest · Answer 4 · 2013-12-13T20:12:52+0000

In theory, I think you should be able to train the Markov model and use it to decide if a string can be a sentence or probably garbage. Another question about how to do this is to recognize words, not sentences: How to determine if a random string sounds like English?

The only difference in learning the sentences is that your probability tables will be slightly larger. However, in my experience, a modern desktop computer has more than enough RAM for processing Markov matrices, if you do not train throughout the Library of Congress (which is optional - even 5 books for different authors should be enough for a very accurate classification).

Since your sentences are squashed without clear word boundaries, this is a little complicated, but the good news is that the Markov model does not care about words, namely what follows. This way you can make it ignore spaces by first removing all spaces from your training data. If you intend to use Alice in Wonderland as a study text, the first paragraph might look like this:

alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation

It looks strange, but as for the Markov model, this is a trivial difference from the classical implementation.

I see that you are worried about time: training can take several minutes (provided that you have already collected texts with the gold standard "sentences" and "random scrambled lines"). You only need to train, you can easily save the “trained” model to disk and reuse it for subsequent runs when loading from disk, which may take several seconds. When invoking a string, a trivially small number of floating point multiplications is required to get the probability, so after training it should be very fast.

Ronan paixão · Answer 5 · 2014-06-18T23:41:19+0000

This code has been changed from How to split text without spaces into a word list? :

 from math import log words = open("english125k.txt").read().split() wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words)) maxword = max(len(x) for x in words) def infer_spaces(s): """Uses dynamic programming to infer the location of spaces in a string without spaces.""" # Find the best match for the i first characters, assuming cost has # been built for the i-1 first characters. # Returns a pair (match_cost, match_length). def best_match(i): candidates = enumerate(reversed(cost[max(0, i-maxword):i])) return min((c + wordcost.get(s[ik-1:i], 9e999), k+1) for k,c in candidates) # Build the cost array. cost = [0] for i in range(1,len(s)+1): c,k = best_match(i) cost.append(c) # Backtrack to recover the minimal-cost string. costsum = 0 i = len(s) while i>0: c,k = best_match(i) assert c == cost[i] costsum += c i -= k return costsum

Using the same dictionary of this answer and checking your string outputs

 >>> infer_spaces("hithisisastringthatmustbechecked") 294.99768817854056

Here you will find out which threshold you can use, bearing in mind that using smaller words makes the cost higher (if the algorithm cannot find any useful word, it returns inf , since it splits everything into single-letter words).

Checking if a string contains an English sentence

More articles: