How to gracefully calculate the signature of a word anagram in ruby?

Arising from this question, I am looking for an elegant (ruby) way to calculate the signature of a word suggested in this answer.

The idea is to sort the letters in the word, as well as run through the length, encoding the repeated letters. For example, Mississippi first becomes iiiimppssss, and then can be further shortened by encoding as 4impp4s.

I am relatively new to the ruby, and although I could hack something, I am sure that this is one liner for someone who has more experience with ruby. I would be interested to see how people fit and improve my ruby ​​knowledge.

edit: to clarify, signature calculation performance is not a big deal for my application. I am looking to calculate a signature, so I can store it with every word in a large database of words (450 thousand words), then request words that have the same signature (i.e. all anagrams of a given word that are actual English words). Hence the focus is on space. The "elegant" part is only to satisfy my curiosity.

+3
algorithm ruby
source share
3 answers

I am also not a very Ruby person, but as I noted in another comment, this seems to work for the described algorithm.

s = "mississippi" s.split('').sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] } 

Of course, you need to make sure that the word has lowercase letters, does not contain numbers, etc.

As requested, I will try to explain the code. Please forgive me if I do not get all the correct terminology of Ruby or reg ex, but here.

I think the split / sort / join section is pretty simple. The interesting part for me begins when gsub is called. This will replace the substring that matches the regular expression with the return value from the subsequent block. Reg ex finds any character and creates a back link. This is the "(.)" Part. Then we continue the matching process using the "\ 1" backlink, which evaluates any character found in the first part of the match. We want this character to be found at least two more times for the total minimum number of occurrences of three. This is done using the quantifier "{2,}".

If a match is found, the corresponding substring is then passed to the next block of code as an argument thanks to "| s |" part. Finally, we use the string equivalent of the corresponding substring length and add to it any character that makes up this substring (they should all be the same) and return the concatenated value. The return value replaces the original substring. The whole process continues until nothing is found, because it is a global replacement for the original string.

I apologize if this is embarrassing. As often happens, it’s easier for me to visualize the solution than to clearly explain it.

+5
source share

The fastest way to create a sorted list of letters:

 "mississippi".unpack("c*").sort.pack("c*") 

This is slightly faster than split ('') and join (). For comparison, it is also better to assemble the array back to String, so you do not need to compare arrays.

+6
source share

I do not see an elegant solution. You can use the split message to get the characters into an array, but then, once you sort the list, I don't see the primitive concatenate linear time primitive to return to the string. I am surprised.

By the way, encoding long strings is almost certainly a waste of time . I needed to see very impressive measurements before I think it is worth considering. If you avoid coding at run time, you can anagrammatize any string , not just a string of letters. And if you know that you only have letters and you are trying to save space, you can pack them 5 bits per letter.

--- Irma Vep


EDIT : Another poster found join , which I missed. Nice.

+2
source share

All Articles