Am I misunderstanding the string # hash in Ruby?

I am processing a bunch of data and I have not yet encoded the duplicate check in the data processor, so I was expecting duplicates to appear. I executed the following SQL query:

SELECT body, COUNT(body) AS dup_count FROM comments GROUP BY body HAVING (COUNT(body) > 1) 

And return the list of duplicates. Looking at this, I found that these duplicates have several hashes. The shortest comment line is "[deleted]" . So let me use this as an example. There are nine instances of the "[deleted]" comment in my database, and in my database it generates a hash of both 1169143752200809218 and 1738115474508091027. 116 were found 6 times, and 173 - 3 times. But when I run it in IRB, I get the following:

 a = '[deleted]'.hash # => 811866697208321010 

Here is the code I use to create the hash:

 def comment_and_hash(chunk) comment = chunk.at_xpath('*/span[@class="comment"]').text ##Get Comment## hash = comment.hash return comment,hash end 

I have confirmed that I do not touch the comment anywhere in my code. Here is my datamapper class.

 class Comment include DataMapper::Resource property :uid , Serial property :author , String property :date , Date property :body , Text property :arank , Float property :srank , Float property :parent , Integer #Should Be UID of another comment or blank if parent property :value , Integer #Hash to prevent duplicates from occurring end 

Do I believe that .hash in a string will return the same value every time it is called on the same string?

What value is the correct value if my line consists of "[deleted]" ?

Is there a way to have different rows inside ruby, but SQL will see them as one row? This seems like the most plausible explanation for why this is happening, but I'm really shooting into the dark.

+7
source share
3 answers

If you run

ruby -e "puts '[deleted]'.hash"

several times, you will notice that the meaning is different. In fact, the hash value remains constant while your Ruby process is alive. The reason for this is that String#hash seeded with a random value. rb_str_hash (C implementation function) uses rb_hash_start , which uses this random seed, which is initialized every time Ruby is generated.

You can use CRC, for example Zlib # crc32 for your own purposes, or you can use one of the OpenSSL::Digest message digests, although the latter is redundant, since you probably won't need security features to detect duplicates.

+9
source

I use the following to create alternatives to the String # hash that are consistent across time and processes

 require 'zlib' def generate_id(label) Zlib.crc32(label.to_s) % (2 ** 30 - 1) end 
+6
source

Ruby intentionally makes String.hash produce different values ​​in different sessions: Why is Ruby String.hash incompatible between machines?

+2
source

All Articles