Rabin-Carp Algorithm

I am interested in implementing the Rabin-Karp algorithm for finding substrings, as stated in the wiki: http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm . Not for homework, but for personal interests. I implemented the Rabin-Karp algorithm (shown below) and it works. However, the performance is really, really bad !!! I understand that my hash function is basic. However, it seems that a simple call to strstr () always surpasses my rabin_karp () function. I can understand why - a hash function does more work than a simple char -by-char compares each cycle. What am I missing here? Should the Rabin-Karp algorithm be faster than calling strstr ()? When is the best time to use the Rabin-Karp algorithm? Hence my personal interests. Have I really implemented the algorithm?

size_t hash(char* str, size_t i) { size_t h = 0; size_t magic_exp = 1; // if (str != NULL) { while (i-- != 0) { magic_exp *= 101; h += magic_exp + *str; ++str; } } return h; } char* rabin_karp(char* s, char* find) { char* p = NULL; if (s != NULL && find != NULL) { size_t n = strlen(s); size_t m = strlen(find); if (n > m) { size_t hfind = hash(find, m); char* end = s + (n - m + 1); for (char* i = s; i < end; ++i) { size_t hs = hash(i, m); if (hs == hfind) { if (strncmp(i, find, m) == 0) { p = i; break; } } } } } return p; } 
+8
c ++ performance string algorithm rabin-karp
source share
3 answers

You have performed the hash incorrectly. The key to Rabin-Karp is to gradually update the hash as a potential match moves along the search string. As you determined, if you recalculate the entire hash for each potential match position, everything will be very slow.

For each case, except for the first comparison, your hash function should take an existing hash, one new character and one old character and return an updated hash.

+12
source share

Rabin-Karp is a sliding hash algorithm - the idea is to be able to move the substring one position in any direction (left or right) and be able to recalculate the hash with a constant number of operations. As this search progresses, the search has complexity O (N * L), where N is the length of the large string and L is the length of the string you are looking for. This is the complexity of the most naive approach and, in my opinion, actually a little poetry.

To improve the algorithm, recompute the exponents of magic_exp and use them to β€œpitch” your hash - basically the same as with polynomials, you need to subtract the highest degree of multiplication by magic_exp and add the hash of the character to the right (to move the hash to the right).

Hope this helps.

+4
source share

strstr uses the KMP algorithm, which is also linear in nature. This means that the complexity of the two algorithms is approximately the same. From now on, the constant is an important factor. Especially when you have bad hash functions with a lot of collisions, KMP will be much faster.

EDIT : One more thing. For the Rabin Karp algorithm, it is very important that all the hash codes of the prefixes are pre-calculated. Now you are not introducing the right Rabin Karp, because the calls to your function will be linear, and not constant in complexity. (Which, incidentally, means that Wikipedia is not a very good source for studying Rabin Karp).

+1
source share

All Articles