"Big" spellcheckers in Python

Surprisingly, I could not find someone who really does this, but, of course, someone has it. I am currently working on a python project that includes spell checking about 16 thousand words. Unfortunately, this number of words will only grow. Right now I am pulling words out of Mongo, iterating through them, and then spell checking them with piyadiy. I removed the mongo as a potential bottleneck, first grabbing all my items. This leaves me about 20 minutes to process 16k words, which is obviously longer than I want to spend. This leaves me with a couple of ideas / questions:

  • Obviously, I could use streams or some form of parallelism. Even if I cut it into 4 parts, I still look for about 5 minutes at maximum performance.

  • Is there any way to tell which spelling library Enchant is used under the pyrene? Enchanting the website seems to imply that when checking spelling it will use all available spelling libraries / dictionaries. If so, then I potentially run every word through three to four spelling words. It may be my problem right here, but it's hard for me to prove it. Even so, is my option to really remove other libraries? That sounds bad.

So, any ideas on how I can squeeze at least a little more performance out of this? I am fine when I do this on parallel tasks, but I still would like to get a part of the kernel to be a little faster before I do it.

Edit: Sorry, laying out before the morning coffee ... Enchant generates a list of sentences for me if the word is spelled incorrectly. This seems to be the place where I spend most of my time in this part of the processing.

+6
python spell-checking pyenchant
source share
3 answers

I think we agree that the performance bottleneck here is Chary; for this data set size, it almost instantly performs a boolean isSpeltCorrectly . So why not:

  • Create a set of correctly spelled words in memory using dictionaries that Enchant makes or selects your own (for example, OpenOffice ).

    Optionally, unify the words of the document, say by putting them in set . This probably will not save you.

  • Check if each word is set in the set or not. This is fast because it is just a search set. (Probably O(log N) , where N is the number of words, assuming the bucket set hash and performs a binary search ... Python guru can fix me here.)

  • If not, ask Enchant to recommend a word. It is inevitably slow.

This assumes that most of your words are spelled correctly; If not, you need to be smarter.

+5
source share

I would use a Peter Norwig style spell checker. I wrote a complete entry about this.

http://blog.mattalcock.com/2012/12/5/python-spell-checker/

Here is a snippet of code that considers possible changes to the word being checked.

 def edits1(word): s = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in s if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1] replaces = [a + c + b[1:] for a, b in s for c in alphabet if b] inserts = [a + c + b for a, b in s for c in alphabet] return set(deletes + transposes + replaces + inserts) 

You have to iterate through your growing word data file in order to check this code very quickly for verification. See Full Post for more information:

http://blog.mattalcock.com/2012/12/5/python-spell-checker/

+2
source share

Perhaps the best way to do this is to compress the document, as this will remove any duplicate copies of words that you really only need to check spelling once. I suggest this only because it will probably be faster than writing your own unique search engine.

The compressed version should have links to unique words, somewhere inside its file, you may have to see how they are structured.

Then you can check all the unique words. I hope that you do not check them with separate SQL queries or something like that, you should load the dictionary in the form of a tree into your memory, and then check the words for that.

Once this is done, just unzip it, and above all, the entire spell will be checked. This should be a pretty quick fix.

Or maybe you don’t have to go through the whole zipping process if the spellcheck is really as fast as the comments suggest, which indicates an incorrect implementation.

+1
source share

All Articles