Find a line for anagram of another line?

I am trying to find a substring from a string text , which is an anagram for the string pattern .

My question is: Can the Rabin-Karp algorithm be adjusted for this purpose? Or are there better algorithms?

I tried the brute force algorithm, which did not work in my case, because each text and template can be up to a million characters.

Update: I heard that there is the worst O (n 2 ) algorithm that uses O (1) space. Does anyone know what this algorithm is?

Update 2: For reference, here is the pseudocode for the Rabin-Karp algorithm:

 function RabinKarp(string s[1..n], string sub[1..m]) hsub := hash(sub[1..m]); hs := hash(s[1..m]) for i from 1 to n-m+1 if hs = hsub if s[i..i+m-1] = sub return i hs := hash(s[i+1..i+m]) return not found 

In this case, the hash movie function is used, which allows to calculate the new hash in O (1), so the general search is O (nm) in the worst case, but with a good hash function - O (m + n) in the best case. Is there a hash movie function that would create few collisions when searching for string anagrams?

+4
source share
4 answers

Calculate a hash of a pattern that does not depend on the order of letters in the pattern (for example, use the sum of the character codes for each letter). Then apply the same hash function in rolling mode to the text, as in Rabin-Karp. If the hashes match, you need to perform a full test of the template in the current window in the text, because the hash may collide with other values.


By associating each character in the alphabet with a prime, then calculating the product of these primes as your hash code, you will have fewer collisions.

However, there is a bit of mathematical trickery that will help you if you want to calculate the running product as follows: each time you step into the window, multiply the current hash code by the multiplicative reverse code for the character exiting the window, and then multiply by the code for character that goes into the window.

As an example, suppose you compute a hash of the letters 'a' - 'z' as an unsigned 64-bit value. Use the table as follows:

  symbol |  code |  code -1
 ------- + ------ + ---------------------
    a |  3 |  12297829382473034411
    b |  5 |  14757395258967641293
    c |  7 |  7905747460161236407
    d |  11 |  3353953467947191203
    e |  13 |  5675921253449092805
   ...
    z |  103 |  15760325033848937303

The multiplicative inverse of n is a number that gives 1 when multiplied by n modulo some number. The module here is 2 64 since you are using 64-bit numbers. So 5 * 14757395258967641293 should be 1, for example. This works because you just multiply in GF (2 64 ).

Computing a list of first primes is easy, and your platform should have a library to efficiently calculate the multiplicative inverse of these numbers.

Start coding with number 3, because 2 is compatible with the size of an integer (power 2 on any processor you are working on) and cannot be inverted.

+9
source

One option would be to maintain a sliding window containing a histogram of the letters contained in the window. If this histogram ever ends up equal to the character's histogram for the row whose anagram is to be found, then you know that what you are looking for is a match and can output it. If not, you know that what you have cannot be a coincidence.

More specifically, create an associative array. Display characters with their frequencies. If you want to find an anagram of string P, read first | P | characters from the text string T in and accordingly build a histogram. You can move the window one step forward and update A in the associative array O (1) operations, decreasing the frequency associated with the first character in the window, and then increasing the frequency associated with the new character that slid into the window.

If the histograms of the current window and the template window are very different from each other, you should be able to compare them pretty quickly. In particular, let's say that your alphabet is & Sigma ;. In the worst case, comparing two histograms will take O (| & Sigma; |) time, since you will need to check each pair of symbols / frequencies on histogram A with a reference histogram. In the best case, though, you would immediately find a symbol that causes a mismatch between A and the reference histogram, so you won’t need to look at many characters in general.

Theoretically, the worst execution time for this approach is O (| T || & Sigma; | + | P |), since you have to do O (n) work on building the initial histogram, then you have to do the worst -case & Sigma; working on a character in T. However, I would expect that this is probably much faster in practice.

Hope this helps!

+9
source
  • Create an array of 26 ints letter_counts (set to zero) and a missing_count variable to hold the number of missing letters.

  • For each letter in the substring, decrease the associated int letter_counts by 1 and increase the value of missing_count by 1 (so that missing_count will be equal to the size of the substring).

  • Let's say that the substring has size k. Look at the first k letters of the string. increase the binding of int to alphabetic characters by one. If after the increment the value is <= 0, the reduction is absent by the computer by 1.

  • Now we move forward on this line. a. delete the letter closest to the top of the window, reduce the associated letter_counts element. If, after decrementing, we have int <0, then increment missing_count by 1. b. add the first line of the line outside the window. increase the associated letter_counts element. If after the increment we have int <= 0, then the decrement is missing_count by 1.

If at any point missing_count == 0 we have an anagram of the search bar in our window.

The invariant that we support is that missing_count contains the number of letters in our substring that are not in our window. When this value is zero, the letters in our window exactly match the letters in our substring.

This is Theta (n) - linear time, since we look at each letter exactly once.

--- change ---

letter_counts need to store only individual letters of a substring and store only integers the size of up to a substring (signature). Thus, the memory usage is linear in the size of the substring, but constant in the size of the string.

+2
source

It might be foolish for me to suggest this, but one alternative would be to split two strings into arrays and then search them recursively for each character.

To avoid duplicate matches, if a character is found in the text array, its corresponding array index is deleted, which effectively reduces the time to complete the array with each match, while ensuring that the text containing 2x 'B' does not match pattern with 3x 'B'.

To improve performance, you can scan both lines before executing a bitwise character and compile a list of letters in each line, and then compare these lists to see if there are any inconsistencies (for example, trying to find the letter "z" in "apple") , and if there is a line mark as "Anagram not possible".

0
source

All Articles