Sorry for the long post, I have a question about general hashing algorithms using CRP, for example, SHA-family, MD5, etc.
In general, such a hash algorithm cannot be injective, since the digest actually created has usually a fixed length (for example, 160 bits for SHA-1, etc.), while the space of possible messages to be digested is almost infinite.
However, if we generate a digest for a message that does not exceed as much as a digest is generated, what are the properties of the commonly used hashing algorithms? Can they be injectable in this limited message space? Are there algorithms that are known to create collisions even on messages whose bit lengths are shorter than the bits of the generated digest?
I am really looking for an algorithm that has this property, i.e. which, at least in principle, can generate counter hashes even for short input messages.
Reference Information. We have a browser plug-in that, for each website visited, asks the server to request whether the website belongs to one of our well-known partners. But, of course, we do not want to monitor our users. Therefore, to make it difficult to create any kind of surfing history, we actually do not send the visitor’s URL, but the hash digest (currently SHA-1) of some cleared version. On the server side, we have a hash table of known URIs that is mapped to the received hash. We can live with some uncertainty here, because we believe that we can’t track our users, not a mistake.
For obvious reasons, this scheme is quite fuzzy and allows false positives, as well as URIs that do not match those that they should have.
So, right now we are considering changing the generation of fingerprints to something that has more structure, for example, instead of hashing the full (cleared) URI, we could instead:
- divide the host name into components by "." and hash these individually
- check the path in the components for "." and hash these individually
Attach the received hashes to the fingerprint value. Example: hashing "www.apple.com/de/shop" using this scheme (and using Adler 32 as a hash) can result in "46989670.104268307.41353536 / 19857610/73204162."
However, as such, the fingerprint has a lot of structure (in particular, compared to a simple SHA-1 digest), we could accidentally again easily calculate the actual URI that the user visited (for example, using a pre-computed hash table for "common "componentt values, such as" www ").
So, now I'm looking for a hash / digest algorithm that has a high collision speed (seriously considered by Adler 32) even in short messages, so the likelihood that the hash component of this component is unique is low. We hope that the additional structure that we impose gives us enough additional information to improve the consistency of behavior (i.e., to reduce the rate of false positives / false negatives).