I am processing a bunch of data and I have not yet encoded the duplicate check in the data processor, so I was expecting duplicates to appear. I executed the following SQL query:
SELECT body, COUNT(body) AS dup_count FROM comments GROUP BY body HAVING (COUNT(body) > 1)
And return the list of duplicates. Looking at this, I found that these duplicates have several hashes. The shortest comment line is "[deleted]" . So let me use this as an example. There are nine instances of the "[deleted]" comment in my database, and in my database it generates a hash of both 1169143752200809218 and 1738115474508091027. 116 were found 6 times, and 173 - 3 times. But when I run it in IRB, I get the following:
a = '[deleted]'.hash
Here is the code I use to create the hash:
def comment_and_hash(chunk) comment = chunk.at_xpath('*/span[@class="comment"]').text
I have confirmed that I do not touch the comment anywhere in my code. Here is my datamapper class.
class Comment include DataMapper::Resource property :uid , Serial property :author , String property :date , Date property :body , Text property :arank , Float property :srank , Float property :parent , Integer
Do I believe that .hash in a string will return the same value every time it is called on the same string?
What value is the correct value if my line consists of "[deleted]" ?
Is there a way to have different rows inside ruby, but SQL will see them as one row? This seems like the most plausible explanation for why this is happening, but I'm really shooting into the dark.
Noah clark
source share