Are cryptographic hashes injective under certain conditions?

Sorry for the long post, I have a question about general hashing algorithms using CRP, for example, SHA-family, MD5, etc.

In general, such a hash algorithm cannot be injective, since the digest actually created has usually a fixed length (for example, 160 bits for SHA-1, etc.), while the space of possible messages to be digested is almost infinite.

However, if we generate a digest for a message that does not exceed as much as a digest is generated, what are the properties of the commonly used hashing algorithms? Can they be injectable in this limited message space? Are there algorithms that are known to create collisions even on messages whose bit lengths are shorter than the bits of the generated digest?

I am really looking for an algorithm that has this property, i.e. which, at least in principle, can generate counter hashes even for short input messages.

Reference Information. We have a browser plug-in that, for each website visited, asks the server to request whether the website belongs to one of our well-known partners. But, of course, we do not want to monitor our users. Therefore, to make it difficult to create any kind of surfing history, we actually do not send the visitor’s URL, but the hash digest (currently SHA-1) of some cleared version. On the server side, we have a hash table of known URIs that is mapped to the received hash. We can live with some uncertainty here, because we believe that we can’t track our users, not a mistake.

For obvious reasons, this scheme is quite fuzzy and allows false positives, as well as URIs that do not match those that they should have.

So, right now we are considering changing the generation of fingerprints to something that has more structure, for example, instead of hashing the full (cleared) URI, we could instead:

  • divide the host name into components by "." and hash these individually
  • check the path in the components for "." and hash these individually

Attach the received hashes to the fingerprint value. Example: hashing "www.apple.com/de/shop" using this scheme (and using Adler 32 as a hash) can result in "46989670.104268307.41353536 / 19857610/73204162."

However, as such, the fingerprint has a lot of structure (in particular, compared to a simple SHA-1 digest), we could accidentally again easily calculate the actual URI that the user visited (for example, using a pre-computed hash table for "common "componentt values, such as" www ").

So, now I'm looking for a hash / digest algorithm that has a high collision speed (seriously considered by Adler 32) even in short messages, so the likelihood that the hash component of this component is unique is low. We hope that the additional structure that we impose gives us enough additional information to improve the consistency of behavior (i.e., to reduce the rate of false positives / false negatives).

+4
source share
3 answers

I do not believe that hashes are guaranteed to be injective for messages of the same size as the digest. If that were the case, they would be bijective, which would not have a hash point. This suggests that they are not injective for messages smaller than the digest.

If you want to encourage collisions, I suggest you use any hash function that you like, and then throw the bits away until it collides enough.

For example, throwing 159 bits of the SHA-1 hash will give you a pretty high collision speed. You may not want to throw away a lot.

However, what you are trying to achieve seems self-doubtful. You want to say that the URL is one of yours, but not the one that it is. This means that you want your URLs to collide with each other, but not with URLs that are not yours. A hash function will not do this for you. Rather, because collisions will be random, since there are still many URLs that do not belong to you than (I suppose!), Any given level of collision will lead to significant confusion as to whether the URL is one of yours or not which of yours.

Instead, how about sending a list of URLs to the plugin at startup, and then just send back one bit indicating whether it will visit the URL in the list? If you don't want to explicitly send URLs, send hashes (without trying to maximize collisions). If you want to save space, send the Bloom filter .

+3
source

Since you are ready to accept a false positive rate (that is, random sites identified as white when in fact this is not the case), the Bloom filter may be just that.

Each client loads a Bloom filter containing the entire whitelist. Then the client does not need to contact the server, and there is no risk of espionage.

With 2 bytes per URL, the false positive rate will be below 0.1%, and with 4 bytes per URL below 1 in 4 million.

Downloading the entire filter (and possibly regular updates) is a big investment in front throughput. But suppose there are a million URLs on it (which is quite a lot for me, given that you can probably apply some rules for canonicalizing URLs before searching), this is a 4 MB download. Compare this with a list of a million 32 hashed hashes: the same size, but false positive speed will be somewhere around 1 out of 4 thousand, so the Bloom filter wins for compactness.

I don’t know how the plugin connects to the server, but I doubt that you can get an HTTP transaction that is much lower than 1 KB - possibly with a keep-alive connection. If filter updates are less frequent than one for every 4k visits of a given user's URL (or fewer if the number of millions of URLs is less than or equal to 1 out of 4 million false positives), this has a chance to use less bandwidth than the current scheme, and Of course, leak less information about the user.

This does not work if you want the new URLs to be whitelisted immediately, although I believe that the client could still get to the server every time the page is requested, to check if the filter has been changed, and if so download the patch updates.

Even if the Bloom filter is too large to be fully loaded (possibly in cases where the client does not have permanent storage and RAM is limited), then I think that you can still enter some information, hiding, computing the client, which bits of the Bloom filter, which he should see, and requesting them from the server. With a combination of caching in the client (the higher the percentage of the filter that you cached, the less bits you need to request, and therefore, the less you tell the server), asking for a window around the actual bit that you care about (so you don’t tell the server exactly which bit you need), and the client requesting the false bits really doesn’t need it (hide the information from noise), I expect that you can confuse which URLs you are visiting. To analyze how realistically this works, some analysis will be needed: the spy will strive to find a template in your requests that correlates with viewing a particular site.

+1
source

I get the impression that you really want public key cryptography , where you provide the visitor with the public key used to encode the URL, and you decrypt the URL with the secret key.

JavaScript implementation is a bit everywhere .

0
source

All Articles