How can I evaluate the implementation of the hash table? (Using HashMap as a reference)

Problem:

  • I need to compare two implementations of the hash table (well, basically HashMap with the other) and make a reasonable conclusion.

  • I am not interested in 100% accuracy, but just being in the right direction in my assessment.

  • I am interested in the difference not only for the operation, but mainly from the hash table as an integer.

  • I do not have a strict speed requirement, so if the other implementation is reasonably slower, I can accept it, but I do expect / require better memory usage (since one of the hash tables is supported by a primitive table).

What i have done so far:

Initially, I created my own “test” with loops and many hints for gc to get an idea of ​​the difference, but I read online that using a standard tool is more reliable / suitable.
An example of my approach (MapInterface is just a wrapper, so I can switch between implementations.):

 int[] keys = new int[10000000]; String[] values = new String[10000000]; for(int i = 0; i < keys.length; ++i) { keys[i] = i; values[i] = "" + i; } if(operation.equals("put", keys, values)) { runPutOperation(map); } public static long[] runOperation(MapInterface map, Integer[] keys, String[] values) { long min = Long.MAX_VALUE; long max = Long.MIN_VALUE; long run = 0; for(int i = 0; i < 10; ++i) { long start = System.currentTimeMillis(); for(int i = 0; i < keys.length; ++i) { map.put(keys[i], values[i]); } long total = System.currentTimeMillis() - start; System.out.println(total/1000d + " seconds"); if(total < min) { min = time; } if(total > max) { max = time; } run += time; map = null; map = createNewHashMap(); hintsToGC(); } return new long[] {min, max, run}; } public void hintsToGC() { for(int i = 0; i < 20; ++i) { System.out.print(". "); System.gc(); try { Thread.sleep(100); } catch (InterruptedException e) { e.printStackTrace(); } } } private HashMapInterface<String> createNewHashMap() { if(jdk) { return new JDKHashMapWrapper<String>(); } else { return new AlternativeHashMapWrapper<String>(); } } public class JDKHashMapWrapper implements HashMapInterface<String> { HashMap<Integer, String> hashMap; JDKHashMapWrapper() { hashMap = new HashMap<Integer, String>(); } public String put(Integer key, String value) { return hashMap.put(key, value); } //etc } 

(I want to test put , get , contains and memory usage)
Can I be sure using my approach to get reasonable measurements?
If not for the most suitable tool to use, and how?

Update:
- I also test random numbers (also ~ 10M random numbers) using SecureRandom.
- When resizing a hash table, I print the logical hash table size / actual table size to get load factor

Update:
For my particular case, where integers are also of interest to me, what can be with pitfalls with my approach?

UPDATE after @ dimo414 comments :

Well, at least a hash table as a “whole” doesn't make sense

I mean, how a hash table behaves under different loads both at runtime and in memory consumption.

Each data structure is a compromise between different methods.

I agree. My compromise is an acceptable penalty for accessing memory.

You need to determine which features you want to test.

1) put (key, value);
2) get (key, value);
3) containsKey (key);
4) all of the above in the presence of a large number of entries in the hash table

+5
source share
3 answers

A key consideration for using Hash tables is the size of the bucket distribution, the conflict resolution strategy, and the shape of your data. Essentially, the Hash table accepts the key provided by the application and then hashes it to a value that is less than or equal to the number of buckets allocated. When two hashes of the hash values ​​refer to the same bucket, the implementation should resolve the conflict and return the correct value. For example, you could sort the linked list for each bucket and search for this list.

If your data has many collisions, your performance will suffer because the implementation of the Hash table will spend too much time resolving the conflict. On the other hand, if you have a very large number of buckets, you solve the collision problem due to memory. In addition, the built-in implementation of the HashMap in Java will be “rephrased” if the number of entries exceeds a certain amount. I believe this is an expensive operation that should be avoided.

Since your key data is positive integers from 1 to 10 M, your test data looks good. I also guarantee that the implementation of the various hash tables has been initialized to the same bucket size for this test, otherwise this is not a fair comparison. Finally, I would change the bucket size to a fairly significant range and repeat the tests to see how the implementations changed their behavior.

+1
source

As I understand it, you are interested in both the execution time of the operations and the memory consumption of the cards in the test.

I will start by consuming memory, as these seams will not respond at all. I suggest using a small library called Classmexer . I personally used it when I need to get 100% correct memory consumption for any object. It has a java agent approach (because it uses the Instrumentation API), which means you need to add it as a parameter to the JVM that runs your tests:

 -javaagent: [PATH_TO]/classmexer.jar 

Using Classmexer is very simple. At any time, you can get the memory consumption in bytes by doing:

 MemoryUtil.deepMemoryUsageOf(mapIamInterestedIn, VisibilityFilter.ALL) 

Please note that with the visibility filter you can specify whether to calculate the memory for an object (our map) plus all other available objects through links. What is VisibilityFilter.ALL for? However, this will mean that the size you get includes all the objects that you used for keys and values. Thus, if you have 100 Integer / String entries, then the specified size will also contain.

For the synchronization aspect, I would suggest the JMH tool, since this tool is designed for micro-scanning . There are many examples on the Internet, for example, in this article there are examples of testing cards that can help you very well.

Note that I have to be careful when you call the Classmexer memory utility, as this will interfere with the time results if you call it during a time measurement. In addition, I am sure that there are many other tools similar to Classmexer, but I like it because it is small and simple.

+1
source

I just did something similar to this, and I ended up using the built-in profiler in the Netbeans IDE . You can get very detailed information about CPU and memory usage. I originally wrote all my code in Eclipse, but Netbeans has an import function for casting Eclipse projects, and it has put all this without problems, if this is possible and your situation.

For synchronization, you can also look at the StopWatch class in Apache Commons. This is a much more intuitive way to track time for targeted operations, for example:

 StopWatch myMapTimer = new StopWatch(); HashMap<Integer, Integer> hashMap = new HashMap<>(); myMapTimer.start(); for (int i = 0; i < numElements; i++) hashMap.put(i, i); myMapTimer.stop(); System.out.println(myMapTimer.getTime()); // time will be in milliseconds 
0
source

All Articles