How can I improve spell check time in program C?

As one of the assignments in the Harvard CS50 course, students are tasked with creating a spell checker program. The main goal of the assignment is speed - pure speed - and I have come to the point that I am beating up execution by staff, but I feel that I can do better and I am looking for a push in the right direction.

Here is my pseudo code:

// read the dictionary word list Read entire dictionary in one fread into memory rawmemchr through and pick out the words send each word through the hash function create chain links for any index where collisions occur // accept the incoming test words Run the test word through the hash function compare to the existing table / linked list return the result of the comparison 

With a dictionary of 150K words and input text up to 6 MB, I can accurately check the spelling in about half a second.

However, when I look at words that come from the input text, it is pretty clear that a large percentage of these words are common (for example, "the", "and", "for") and that most of the misspelled words are also checked several times .

My intuition says that I have to “cache” “good hits” and “bad hits” so that I don’t repeat the same words to find tables. Despite the fact that the current result is very close to O (1), I feel that I need to shave off a few microseconds since the revision of my approach.

For example, after I downloaded the dictionary, there can only be 8 MB for text input, nothing but this: "missspeling". Therefore, instead of hashing / checking the same word again and again (through computation), I would like to understand if there is a way to programmatically reject words that have already been hashed and rejected, but in a more efficient way than the hash itself / check. (I am using MurmurHash3, fwiw).

I understand that theoretical performance improvements will be limited to situations where the input text is long and there are a large number of repeated spelling errors. Based on some source texts that I rated, here are some of the results:

 Unique Misspellings: 6960 Total Misspellings: 17845 Words in dictionary: 143091 Words in input text: 1150970 Total Time: 0.56 seconds Unique Misspellings: 8348 Total Misspellings: 45691 Words in dictionary: 143091 Words in input text: 904612 Total Time: 0.83 seconds 

In the second run of the run, you can see that I need to return to the hash table about 5.5 times for every word with errors! It seems to me that it seems to me that I believe that there should be a more efficient way to deal with this fact, since most of my program time is spent on a hash function.

I could implement Posix threads (this works on the 8th system system) to improve program time, but I'm more interested in improving my approach and thought process around the problem.

Sorry, this is so long, but this is my first post, and I try to be thorough. I looked before publishing, but most of the other spell checking posts are related to "how," not "improving." I am grateful for the suggestions that make me point in the right direction.

http://github.com/Ganellon/spell_check

+6
source share
3 answers

This is a pretty well-resolved problem. ;-) You should study a data structure called trie . Three is a tree made up of individual characters, so the path represents information. Each node consists of letters that you can legally add to the current prefix. When a letter is a valid word, it is also spelled.

For four words:

 root-> [a]-> [a]-> [r]-> [d]-> [v]-> [a]-> [r]-> [k*]->[s*] [b] \> [a]-> [c]-> [i*] [u]-> [s*] 

This will mean "aardvark", "aardvarks", "abaci" and "abacus". Nodes are vertically adjacent, so the second letter [ab] is a node, and the fifth letter [i * u] is a node.

Pass the triangle symbol through the symbol and check the real word when you press the spacebar. If you can’t get through with the character you have, then that’s a bad word. If you haven’t found a valid one when you press the spacebar, that’s a bad word.

This is O (n) for processing (n = word length), and it is very, very fast. Building a trie will consume a bunch of RAM, but you don't care what I think.

+6
source

In both of your trials, which is noticeable, most of the words are spelled correctly. Therefore, you should focus on optimizing the search for words that are in the dictionary.

In your first test, for example, only 1.5% of all words are spelled. Suppose that, on average, twice a search is required for a word that is not in the dictionary (because each word in the bucket needs to be checked). Even if you reduce it to 0 (theoretical minimum :)), you will speed up your program by less than 3%.

A general optimization of the hash table is to move the key you find to the top of the bucket chain if it does not already exist. This will tend to reduce the number of hash entries checked for common words. This is not a huge acceleration, but in cases where some keys are viewed much more often than others, this can definitely be noticed.

Reducing the length of the chain by reducing the loading of the hash table can help more, due to more memory.

Another possibility, since you are not going to change the dictionary after its creation, is to store each bucket chain in continuous memory without pointers. This will not only reduce memory consumption, but also improve cache performance, because since most words are short, most buckets will be on the same cache line.

And since the words are usually quite short, you can find a way to optimize the comparison. strcmp() well optimized, but it is usually optimized for large strings. If you are allowed to use it, the SSE4.2 PCMPESTRI opcode will be incredibly powerful (but figuring out what it does and how to use it to solve your problem can be a huge time). More simply, you should be able to compare four eight-byte prefixes simultaneously with 256-bit comparison operations (and you can even have access to operations with 512 bits), so with smart data composition, you can completely compare the entire vessel in parallel.

Not to say that hash tables are necessarily the optimal data structure for this problem. But remember that the more you can do in one cache line, the faster your program will work. Intensive catalog-linked directories can be suboptimal, even if they look good on paper.


After thinking about this problem for a couple of days and actually writing some code, I came to the conclusion that optimization for a successful hash page search speed is probably not suitable for real-world authentication. It’s true that most of the words in the text you’re viewing are usually spelled correctly - although it depends on the spell checking user - but an algorithm that tries to suggest the correct spelling is likely to make many unsuccessful searches, as it cycles through possible spelling errors, I know that this is probably outside the scope of this problem, but it matters for optimization, because in the end you get two completely different strategies.

If you are trying to quickly reject, you need a lot of empty bucket chains or a Bloom filter or its moral equivalent, so you can reject most errors on the first probe.

For example, if you have a good hashing algorithm that gives more bits than you need - and you almost certainly do it, because the spelling dictionaries are not so big - then you can just use some otherwise unused bits in the hash for the secondary hash. Without even creating a problem with implementing the entire Bloom filter, you can simply add, say, a 32-bit mask to each bucket that represents the possible values ​​of the five secondary hash bits in the values ​​stored in this bucket. In combination with a rare table - I used 30% employment for the experiment, which is not so meager - you should be able to refuse 80-90% of search failures without going beyond the bucket header.

On the other hand, if you are trying to optimize success, it may turn out that rather large buckets are better, because it reduces the number of bucket headers and improves cache utilization. As long as the entire bucket fits into the cache line, the speed of multiple comparisons is so high that you won't notice the difference. (And since words tend to be short, it is reasonable to expect five or six to fit on a 64-byte scale.)

In any case, without spending too much work, I was able to perform a millionth search in 70 milliseconds of the processor. Multiprocessing can speed up the elapsed time quite a bit, especially if locking is not required, given that the hash table is unchanged.

The moral I want to learn from this:

To optimize:

  • you need to understand your data

  • you need to understand your intended use pattern

  • you need to adapt your algorithms based on the above

  • you need to experiment a lot.

+4
source

A few ideas / ideas you can explore:

  • where values ​​are similar in length or slightly larger than pointers - closed hashing will give you better performance than any open hashing, as well as a separate approach to the chain

  • the length of the checked words is cheap (perhaps free if you track it), since you can direct checks to the methods that are most optimal for this word length

    • to get more words on fewer pages of memory (and thus be more cache friendly), you can try to have several hash tables where buckets are sized to the longest text length in it

    • 4-byte and 8-byte buckets conveniently allow you to compare with 32-bit and 64-bit comparisons with one instruction, if you type lines using NUL (that is, you can combine uint32_t and char[4] , or uint64_t and char[8] , and compare the integer values).

    • your hash selection is important: try some good ones

    • your collision management strategy is also important: a profile with a linear, quadratic and possibly a list of primes (1, 3, 7, 11 ...).

    • the number of buckets is a balancing act: too few and you have too many collisions, too many buckets and you have more misses in the memory cache, so check the range of values ​​to find the optimal settings

  • you can use the profile using a larger number of non-collision codes with % hash tags in the bucket index range compared to a capacity of two buckets where you can use faster & bitmasking

  • many of the above interact, for example. if you use a strong hash function, you less need a simple number of buckets; if you have fewer collisions, you less need a detailed search order after a collision using alternative buckets.

  • spell checking is very easy to scale with streams, because you only search read-only hash tables; the previous insertion of the dictionary into the hash table is even more so, although the use of several tables, as described above, offers one way to parallelize it.

+1
source

All Articles