Fastest Java hash algorithm for strings

To make this simple, my question is: how to quickly delete a line (about 200 characters). Safety is not important, but a collision is a big deal.

Note. After a quick investigation, it seems that MurmurHash3 might be the best choice. I am open to any comment saying otherwise

Firstly, I know that there are many other similar questions, but so far I have not found a convincing answer.

I have a list of objects, each of which contains a list of approximately 3k paragraphs, which is stored in the database. Every X hours, this paragraph is updated, and I need to find if any paragraphs have changed, and if so, click only those new paragraphs.

The fastest way to find differences (knowing that most of the time the content will be identical) is to create MerkleTree , save it in the database and iterate over MerkleTree to find the differences, instead of comparing the paragraphs themselves.

This means that in my case I will create ten thousand hashes per second to compare with what is in the database. So I need a very efficient way to create these hashes. I don't care about security, I just need to make sure that the number of collisions remains very low.

What would be the best algorithm available in Java for this?


In my case, the main object consists of Sections, which consists of Languages, which consists of a Paragraph. Comparison Strategy:

1) If the hash of the object is identical, stop, otherwise go to 2)

2) Loop on the whole section, save only the section with another hash

3) Loop in all languages ​​of these sections, keep only the language with a different hash

4) Loop over the entire paragraph of all these languages, if the hash is different, then click the new content.

+7
java hash
source share
1 answer

This amazing answer to "Stack Programmers" tells you everything you need to know.

Short version, use FNV-1a, as well as the Fowler-Noll-Vo hash function , it has excellent performance, high randomness and low collisions.

Any further explanation I could shed on this question would be just a copy and paste from this answer of Programmers.SE, which, by the way, is the second highest voice response on the whole site.

Some other thoughts:

  • Ultimately, you have a use case. Most people do not deal with 1 billion data sets on a regular basis. Thus, you may have to do your own benchmarking.
  • However, the presence of high randomness suggests that the algorithm probably scales well for English hashes.
  • You really didn't talk about other issues; can you store the entire data set in memory? What are your print requirements?

See also: Fastest Hash Algorithm for Text Data

+5
source share

All Articles