How to create a (almost) unique hash identifier for objects?

How can I get an identifier for my objects, which makes it easy to distinguish it from others?

class MyClass { private String s; private MySecondClass c; private Collection<someInterface> coll; // ..many more public Result calculate() { /* use all field values recursively to calculate the result */ /* takes considerable amount of time. Implemented */ return result; } public String hash() { /* use all field values recursively to generate a unique identifier */ // ????? } 

calculate() usually takes ~ 40 seconds. Thus, I do not want to call him several times.

MyClass objects are pretty huge (~ 60 MB). The value of the Result calculation will be only ~ 100KB.

Whenever I am going to perform a calculation on an object, my program should look, if it has been done some time earlier, with exactly the same values, recursively. If so, it will look for the result (for example) of a HashMap . In principle, MyClass objects themselves can be used as keys, but the HashMap will include 30-200 elements - I obviously do not want to store all this in full size. Therefore, I want to keep the values ​​of 30-200 Hash/result .

So, I thought that I would create an identifier (hash) for all the values ​​inside my MyClass object. How to do it? That way, I can use this hash to find the result. I know that a hash code such as MD5 does not guarantee 100% uniqueness , because several objects can have the same hash. However, if I save (maximum) 200 elements through MD5, the likelihood of using a twice-used hash will, as it seems to me, be negligible. There are 16^32=3.4e38 different hash codes. I will be glad to hear from him anybodys comments or see other approaches.

After creating the hash, I no longer need this object, just its Result value.

Two separate objects with the same values ​​must return the same hash code. Like the original hashCode (), it's just that I try to maintain uniqueness. The probability for two objects having the same hash code should be absolutely negligible.

I do not know how to describe the problem in other words. If further clarification is required, please ask.

So how can I generate my MyClass.hash() ?

The problem is not how and where to store the hashes, because I don’t even know how I can generate a (almost) unique hash for the whole object, which will always be the same for the same values.


Clarification:

Speaking of size, I mean the serialized size on the hard drive.

I do not think that placing objects in a HashMap will reduce their size. That I want to save some hash string instead. HashMap<hashStringOfMyClassObject, resultValue>

When you put an object in a HashMap (either as a key or as a value), you are not creating a copy of it. Thus, storing 200 large objects in HashMap consumes a bit more memory than 200 objects.

I do not store 200 large objects. I save only 200 different results (as values) that are small, and 200 corresponding hash codes of MyClass objects, which are also very small. The hash point of objects should work with a hash, not with the values ​​of the object itself.

+1
source share
4 answers

If you want to create a hash of all your data, you need to make sure that you can get all the values ​​in byte format from them.

To do this, it is best if you have control over all the classes (with the exception of the built-in Java modules, perhaps), so you can add a method to do this.

Given that your object is very large, it probably would not be a good idea to just collect it into one large byte array recursively and then calculate the digest. It is probably best to create a MessageDigest object and add a method, for example:

 void updateDigest( MessageDigest md ); 

to each of them. You can declare an interface for this if you want. Each such method will collect its own class data, which participate in the "big calculation" and update the md object with this data. After updating all its own data, it must recursively call the updateDigest method of any classes in it that have this method.

For example, if you have a class with fields:

 int myNumber; String myString; MyClass myObj; // MyClass has the updateDigest method Set<MyClass> otherObjects; 

Then its updateDigest method should do something like this:

 // Update the "plain" values that are in the current object byte[] myStringBytes = myString.getBytes(StandardCharsets.UTF_8); ByteBuffer buff = ByteBuffer.allocate( Integer.SIZE / 8 // For myNumber + Integer.SIZE / 8 // For myString length + myStringBytes.length ); buff.putInt( myNumber ); buff.putInt( myStringBytes.length ); buff.put( myStringBytes ); buff.flip(); md.update(buff); // Recurse myObj.updateDigest(md); for ( MyClass obj : otherObjects ) { obj.updateDigest(md); } 

The reason I added the length of the string (in fact, its length of the byte representation) to the digest is to avoid situations where you have two String fields:

 String field1 = "ABCD"; String field2 = "EF"; 

If you just put your bytes directly into the digest one by one, it will have the same effect on the digest as:

 String field1 = "ABC"; String field2 = "DEF"; 

And this can lead to the creation of an identical digest for two different data sets. Thus, adding length will lead to its ambiguity.

I used ByteBuffer because it is relatively convenient to add something like int and double .

If you have classes that you do not control and cannot add a method, you will have to be creative. In the end, you get the values ​​from each such class for calculation, so you can call the same methods and digest their results. Or you can digest their serialized form if they are serializable.

So, in your class, you will create an md object using MessageDigest.getInstance("SHA") or another digest that you want to use.

 MessageDigest md = null; try { md = MessageDigest.getInstance("SHA"); } catch (NoSuchAlgorithmException e) { // Handle properly } // Call md.update with class own data and recurse using // updateDigest methods of internal objects // Compute the digest byte [] result = md.digest(); // Convert to string to be able to use in a hash map BigInteger mediator = new BigInteger(1,result); String key = String.format("%040x", mediator); 

(In fact, you can use BigInteger itself as a key).

+1
source

You call hash () on an object, and your goal is to remember the result, because the calculation is expensive, and the result is invariant if any state does not change?

So why not save the result in an instance variable of an object. There is some kind of logic like

  calculate() { if ( m_cachedResult == null ){ m_cachedResult = origincalCaclulate(); // refactored original } return m_cachedResult; } 

Then, if you can guarantee that all relevant state will be changed using the installers of this class, clear the cache if necessary recounting

  setThing(newValues) { m_cachedResult = null; //process new state values } 
+3
source

In fact, you have an object called a UUID

A class that represents an immutable universally unique identifier (UUID). UUID represents a 128-bit value.

You can find some ideas here , for example:

 import java.util.UUID; public class GenerateUUID { public static UUID generate() { UUID idOne = UUID.randomUUID(); return idOne; } } 

Then just check if the created objects exist (which will be almost impossible) and call again if necessary.

+2
source

Calculating some hash-like identifier is not the best way to do this overall. The likelihood of conflict is extremely low, but it can still happen . Keep in mind that a hash is not a 100% random number; in most cases it is somehow related to the input, so depending on your hash method, some hashes may not be available or, in the worst case, some of these, they can be common to a fairly large set of input objects. It could be calculated accurately, but it is from the point of view of computer science and probability theory.

Using some function (MD5, SHA, etc.) can help a lot, but it still will not solve the problem completely.

The solution I prefer is similar to Jordi's. Enlarge the class with some identifier. Depending on your project, I will set up, for example, the creation date and / or the name of such a task. A String name or task description can facilitate debugging.

If they are not unique enough, you can add a unique numeric counter (or an instance of UUID ).

+1
source

All Articles