Specific hash type for string concatenation

Question

Specific hash type for string concatenation

I need a specialized hash function h (X, Y) in Java with the following properties.

X and Y are strings.
h (X, Y) = h (Y, X).
X and Y are arbitrary length strings, and there is no length limit for the result of h (X, Y).
h (X, Y) and h (Y, X) must not collide with h (A, B) = h (B, A) if X is not equal to A and Y is not equal to B.
h () does not have to be a secure hash function unless the above requirements are required.
Quite highly effective, but this is an open criterion.

In my opinion, I see that requirements 2 and 4 are somewhat contradictory, but maybe I'm too worried.

Currently, what I am doing in Java is the following:

public static BigInteger hashStringConcatenation(String str1, String str2) { BigInteger bA = BigInteger.ZERO; BigInteger bB = BigInteger.ZERO; for(int i=0; i<str1.length(); i++) { bA = bA.add(BigInteger.valueOf(127L).pow(i+1).multiply(BigInteger.valueOf(str1.codePointAt(i)))); } for(int i=0; i<str2.length(); i++) { bB = bB.add(BigInteger.valueOf(127L).pow(i+1).multiply(BigInteger.valueOf(str2.codePointAt(i)))); } return bA.multiply(bB); }

I think this is disgusting, but therefore I am looking for more pleasant solutions. Thanks.

I forgot to mention that on a dual-core MacBook Pro 2.53 GHz with 8 GB of RAM and Java 1.6 on OS X 10.7, the hash function takes about 270 microseconds for two 8 characters (ASCII). I suspect this will be higher when increasing the size of the string, or if Unicode characters are used.

+4

java string concatenation hash

xtremebytes Jul 31 '12 at 13:34

source share

10 answers

Why not just add their hashCode together?

+3

Markus mikkolainen Jul 31 '12 at 13:37

source share

How strictly do you comply with requirement 4? If the answer is “not completely strict,” then you can simply concatenate the two lines by placing the smaller first (this will lead to a collision for h (“A”, “B”) and h (“AB”, “))

If there are any characters that you are sure will never appear in string values, you can use one instance as a delimiter that would fix the collision above.

+1

codebox Jul 31 '12 at 13:42

source share

3) h (X, Y) and h (Y, X) must not collide with h (A, B) = h (B, A) if X is not equal to A and Y is not equal to B.

I think this requirement controls any hash function that produces numbers that are smaller (on average) than the source strings.

Any requirement for no collisions goes into the roadblock principle of the Pigeonhole .

+1

Stephen c Jul 31 '12 at 13:52

source share

From the 4th point we can get that h(x,"") never collides with h(y,"") until x.equals(y) is true. Thus, you have no size restrictions on what h(x,y) produces, forcing it to produce a unique result for each unique x . But there are an infinite number of unique lines. I think this is not a valid hash function.

+1

gkuzmin Jul 31 '12 at 13:53

source share

Based on the # hashCode string, this is not an ideal hash function, so it does not fulfill condition 4.

 public static long hashStringConcatenation(String str1, String str2) { int h1 = str1.hashCode(); int h2 = str2.hashCode(); if ( h1 < h2 ) { return ((long)h1)<<32 & h2; } else { return ((long)h2)<<32 & h1; } }

0

nabcos Jul 31 '12 at 13:55

source share

Well, the @gkuzmin comment made me think why I am doing 127 permissions. So, here is a slightly simpler version of the code. The changes are as follows:

I no longer perform 127 permissions, but actually concatenate codePointAt numbers as strings, converting the result to BigInteger for each input string, and then adding two BigIntegers.
To compose the answer, I am doing mod 2 ^ 1024 in the final answer.

Speed is not better (maybe a little worse!), But then I think that the method of measuring speed is wrong, because it probably also measures the time spent on calling the function.

Here's the modified code. Does this fulfill all the conditions, although 4 for such unfortunate cases when repetitions can occur in the result space 2 ^ 1024?

 public static BigInteger hashStringConcatenation(String str1, String str2) { if(str1==null || str1.isEmpty() || str2 == null || str2.isEmpty()) { return null; } BigInteger bA, bB; String codeA = "", codeB = ""; for(int i=0; i<str1.length(); i++) { codeA += str1.codePointAt(i); } for(int i=0; i<str2.length(); i++) { codeB += str2.codePointAt(i); } bA = new BigInteger(codeA); bB = new BigInteger(codeB); return bA.add(bB).mod(BigInteger.valueOf(2).pow(1024)); }

0

xtremebytes Jul 31 '12 at 23:25

source share

I decided to add another answer because @Anirban Basu suggested a different solution. So, I do not know how to provide a link to his post, and if someone knows how to do this, correct me.

Anirban solution is as follows:

 public static BigInteger hashStringConcatenation(String str1, String str2) { if(str1==null || str1.isEmpty() || str2 == null || str2.isEmpty()) { return null; } BigInteger bA, bB; String codeA = "", codeB = ""; for(int i=0; i<str1.length(); i++) { codeA += str1.codePointAt(i); } for(int i=0; i<str2.length(); i++) { codeB += str2.codePointAt(i); } bA = new BigInteger(codeA); bB = new BigInteger(codeB); return bA.add(bB).mod(BigInteger.valueOf(2).pow(1024)); }

Your new solution now looks like a hash function, but it still has some problems. I suggest you think about this:

Maybe it would be better to throw a NullPointerException or IllegalArgumentException when null used as an argument to a function? Are you sure you don't want to calculate the hash for empty strings?
To StringBuffer large number of strings, it is better to use a StringBuffer instead of the + operator. Using this class will have a huge positive effect on the performance of your code.
Your hash function is not very safe - it is very simple to calculate the lines that will cause a conflict.

You can try this code to test an algorithm that can demonstrate your hash function.

 public static void main(String[] args){ String firstString=new StringBuffer().append((char)11).append((char)111).toString(); String secondString=new StringBuffer().append((char)111).append((char)11).toString(); BigInteger hash1 = hashStringConcatenation(firstString,"arbitrary_string"); BigInteger hash2 = hashStringConcatenation(secondString,"arbitrary_string"); System.out.println("Is hash equal: "+hash1.equals(hash2)); System.out.println("Conflicted values: {"+firstString+"},{"+secondString+"}"); }

So, it is very easy to break the hash function. Moreover, it is good that it has 2 × 1,024 resulting spaces, but a lot of real conflicts for your implementation lie in very close and simple lines.

PS I think that you should read something about hash algorithms already developed, hash functions that did not execute in real life (for example, a hash function of the java String class that calculated a hash using only the first 16 characters in the past) and tried to study your decisions in accordance with your requirements and real life. At the very least, you can try to find the hash conflict manually, and if you succeed, then your solution most likely already has some problems.

0

gkuzmin Aug 1 '12 at 8:54

source share

Here is my modified code as suggested by @gkuzmin:

 public static BigInteger hashStringConcatenation(String str1, String str2) { BigInteger bA = BigInteger.ZERO, bB = BigInteger.ZERO; StringBuffer codeA = new StringBuffer(), codeB = new StringBuffer(); for(int i=0; i<str1.length(); i++) { codeA.append(str1.codePointAt(i)); } for(int i=0; i<str2.length(); i++) { codeB.append(str2.codePointAt(i)); } bA = new BigInteger(codeA.toString()); bB = new BigInteger(codeB.toString()); return bA.multiply(bB).mod(BigInteger.valueOf(2).pow(1024)); }

Note that as a result, I now multiply bA by bB instead of adding.

In addition, the @gkuzmin function has been added, which offers a test function:

 public static void breakTest2() { String firstString=new StringBuffer().append((char)11).append((char)111).toString(); String secondString=new StringBuffer().append((char)111).append((char)11).toString(); BigInteger hash1 = hashStringConcatenation(firstString,"arbitrary_string"); BigInteger hash2 = hashStringConcatenation(secondString,"arbitrary_string"); System.out.println("Is hash equal: "+hash1.equals(hash2)); System.out.println("Conflicted values: {"+firstString+"},{"+secondString+"}"); }

and another test with strings having only numeric values:

 public static void breakTest1() { Hashtable<String,String> seenTable = new Hashtable<String,String>(); for (int i=0; i<100; i++) { for(int j=i+1; j<100; j++) { String hash = hashStringConcatenation(""+i, ""+j).toString(); if(seenTable.contains(hash)) { System.out.println("Duplication for " + seenTable.get(hash) + " with " + i + "-" + j); } else { seenTable.put(hash, i+"-"+j); } } } }

The code is being executed. Of course, this is not an exhaustive test, but the breakTest1 () function has no problems. The @gkuzmin function displays the following:

 Is hash equal: true Conflicted values: { o},{o }

Why do two lines produce the same hash? Because they work effectively with the strings "11111arbitrary_string" in both cases. This is problem.

0

xtremebytes Aug 2 '12 at 3:48

source share

What about a slightly modified function now?

 public static BigInteger hashStringConcatenation(String str1, String str2) { BigInteger bA = BigInteger.ZERO, bB = BigInteger.ZERO; StringBuffer codeA = new StringBuffer(), codeB = new StringBuffer(); for(int i=0; i<str1.length(); i++) { codeA.append(str1.codePointAt(i)).append("0"); } for(int i=0; i<str2.length(); i++) { codeB.append(str2.codePointAt(i)).append("0"); } bA = new BigInteger(codeA.toString()); bB = new BigInteger(codeB.toString()); return bA.multiply(bB).mod(BigInteger.valueOf(2).pow(1024)); }

Here we add the separator character "0" between the character codes, so the combination of characters 11 111 and 111 11 will no longer confuse this function, because the concatenation will produce 110111 and 111011. However, it still does not break requirement 2 of the original question.

Does this now solve the problem, albeit within the range of 2 ^ 1024?

0

xtremebytes Aug 2 '12 at 5:36

source share

gkuzmin · Accepted Answer · 2012-08-02T08:49:17+0000

Today I decided to add my solution to this hash function. It was not tested very well, and I did not evaluate its performance, so you can return your comments to me. My solution is below:

 public abstract class HashUtil { //determines that we want hash, that has size of 32 integers ( or 32*32 bits ) private static final int hash_size = 32; //some constants that can be changed in sake of avoiding collisions private static final BigInteger INITIAL_HASH = BigInteger.valueOf(7); private static final BigInteger HASH_MULTIPLIER = BigInteger.valueOf(31); private static final BigInteger HASH_DIVIDER = BigInteger.valueOf(2).pow(32*hash_size); public static BigInteger computeHash(String arg){ BigInteger hash = new BigInteger(INITIAL_HASH.toByteArray()); for (int i=0;i<arg.length()/hash_size+1;i++){ int[] tmp = new int[hash_size]; for(int j=0;j<Math.min(arg.length()-32*i,32);j++){ tmp[i]=arg.codePointAt(i*hash_size+j); } hash = hash.multiply(HASH_MULTIPLIER).add(new BigInteger(convert(tmp)).abs()).mod(HASH_DIVIDER); } //to reduce result space to something meaningful return hash; } public static BigInteger computeHash(String arg1,String arg2){ //here I don't forgot about reducing of result space return computeHash(arg1).add(computeHash(arg2)).mod(HASH_DIVIDER); } private static byte[] convert(int[] arg){ ByteBuffer byteBuffer = ByteBuffer.allocate(arg.length*4); IntBuffer intBuffer = byteBuffer.asIntBuffer(); intBuffer.put(arg); return byteBuffer.array(); } public static void main(String[] args){ String firstString="dslkjfaklsjdkfajsldfjaldsjflaksjdfklajsdlfjaslfj",secondString="unejrng43hti9uhg9rhe3gh9rugh3u94htfeiuwho894rhgfu"; System.out.println(computeHash(firstString,secondString).equals(computeHash(secondString,firstString))); }

}

I believe that my solution should not lead to a collision for a single line less than 32 long (more precisely, for a single line with a length less than hash_size ). It is also not so easy to find collisions (as I think). To control the probability of hash conflicts for your specific task, you can try other primes instead of 7 and 31 in the variables INITIAL_HASH and HASH_MULTIPLIER . What do you think about it? Is this enough for you?

PS I think it will be much better if you try much larger primes.

Specific hash type for string concatenation

More articles: