The most frequently repeated numbers in a huge list of numbers

I have a file with many random numbers (about a million), each of which is separated by a space. I need to find the 10 most common numbers in this file. What is the most efficient way to do this in java? I can think of 1. Create a hash map, the key is an integer from the file, and the value is a counter. For each number in the file, check if this key exists in the hash map, if so, the value is ++, otherwise make a new entry in hash 2. Create a BST, each node is an integer from the file. For each integer from the file, see if there is a BST, if so, execute the value ++, the value is part of node.

I feel that a hash map is the best option, if I can come up with a good hashing function, Can someone suggest me what is better to do? Is there another efficient algorithm I can use?

+7
java performance data-structures
source share
11 answers

Edit # 2:

Well, I messed up my first rule - never optimize prematurely. The worst case for this is probably using a wide range of HashMap stock - so I just did it. It still works like a second, so forget about everything else and just do it.

And I will make ANOTHER comment on myself to ALWAYS check the speed before worrying about complex implementations.

(Below is an old obsolete post that can still be valid if someone has MANY more points than a million)

A HashSet will work, but if your integers have a reasonable range (say 1-1000), it would be more efficient to create an array of 1000 integers and for each of your millions of integers to increase this element array. (Pretty much the same idea as HashMap, but optimizing a few unknowns that the Hash should do should make it several times faster).

You can also create a tree. Each node in the tree will contain (value, count), and the tree will be organized by value (lower values ​​on the left, higher on the right). Go to node, if it does not exist - insert it - if so, then just increase the score.

The range and distribution of your values ​​will determine which of these two (or a regular hash) will work best. I think that a regular hash will not have many “winning” cases (it should be a wide range and “grouped” data, and even then the tree can win.

Since this is pretty trivial - I recommend that you implement more than one solution and test speed against the actual data set.

Edit: RE comment

TreeMap will work, but it will add a layer of indirection anyway (and this is so surprisingly easy and interesting to implement yourself). If you are using a stock implementation, you need to use integers and constantly convert to and from int for each increase. There is a pointer to a pointer to Integer and the fact that you store at least 2x as many objects. This does not even take into account the overhead of method calls, since they must be bound with any luck.

Usually this would be optimization (evil), but when you start to approach hundreds of thousands of nodes, you sometimes have to ensure efficiency, so the built-in TreeMap will be ineffective for the same reasons as the built-in HashSet.

+7
source share

Java handles hashing. You do not need to write a hash function. Just start clicking on the hash map.

In addition, if this is something that needs to be done only once (or only occasionally), then not both optimizations. It will be fast enough. Just worry if this is something that will be launched in the application.

+5
source share

Hashmap

A million integers are not very many, even for interpreted languages, but especially for a fast language such as Java. You are unlikely to notice the runtime. I will try this first and move on to something more complex if you find it too slow.

It will probably take more time to separate lines and parse to convert to integers than even the simplest algorithm to search for frequencies using HashMap.

+4
source share

Why use a hash table? Just use an array that matches the range of your numbers. Then you do not waste time performing a hash function. Then sort the values ​​after completion. O (N log N)

+3
source share
  • Select an array / vector of the same size as the number of input elements
  • Fill the array from your file with numbers, one number per element
  • Put the list in order
  • Iterate through the list and track the first 10 running numbers that you encounter.
  • Print the top ten launches at the end.

As a clarification, in step 4 you only need to step forward along the array with a step equal to the 10th longest run. Any mileage greater than this will overlap with your sample. If the tenth longest run is 100 elements, you only need to try the element 100, 200, 300 and at each point count the mileage of the whole that you find there (both forward and backward). Any mileage that exceeds your 10th longest will necessarily overlap with your sample.

You should apply this optimization after your 10th run length is very long compared to other runs in the array.

The card is full for this question if you have few unique numbers with a lot of repetitions.

NB: similar to gshauger's answer, but focused

+1
source share

If you need to make this as efficient as possible, use an ints array with a position representing the value and content representing the score. This way you avoid autoboxing and unboxing, the most likely killer of the standard Java collection.

If the range of numbers is too large, take a look at PJC and its IntKeyIntMap . This will also avoid autoboxing. I do not know if it will be fast enough for you.

+1
source share

If the range of numbers is small (e.g. 0-1000), use an array. Otherwise, use HashMap<Integer, int[]> , where all values ​​are 1 array long. It should be much faster to increase the value in an array of primitives than to create a new Integer every time you want to increase the value. You are still creating Integer objects for keys, but this is hard to avoid. Unable to create an array of 2 ^ 31-1 ints. Finally.

If all input data is normalized, so you don’t have values ​​like 01 instead of 1, use the Lines keys as a map on the map, so you don’t need to create whole keys.

+1
source share

Use HashMap to create your own data set (a pair of values) in memory when moving the file. A HashMap should provide you with O (1) close access to the elements when creating the dataset (technically, in the worst case, HashMap, O (n)). After you finish searching for the file, use Collections.sort () for the Collection value returned by HashMap.values ​​() to create a sorted list of value pairs. Using Collections.sort () is guaranteed by O (nLogn). For example:

 public static class Count implements Comparable<Count> { int value; int count; public Count(int value) { this.value = value; this.count = 1; } public void increment() { count++; } public int compareTo(Count other) { return other.count - count; } } public static void main(String args[]) throws Exception { Scanner input = new Scanner(new FileInputStream(new File("..."))); HashMap<Integer, Count> dataset = new HashMap<Integer, Count>(); while (input.hasNextInt()) { int tempInt = input.nextInt(); Count tempCount = dataset.get(tempInt); if (tempCount != null) { tempCount.increment(); } else { dataset.put(tempInt, new Count(tempInt)); } } List<Count> counts = new ArrayList<Count>(dataset.values()); Collections.sort(counts); 
+1
source share

Actually, there is an O (n) algorithm for doing exactly what you want to do. Your use case is similar to the LFU cache, where the element access counter determines whether it is in the cache or removed from it.

http://dhruvbird.blogspot.com/2009/11/o1-approach-to-lfu-page-replacement.html

+1
source share

This is the source for java.lang.Integer.hashCode() , which is a hash function that will be used if you save the entries as HashMap<Integer, Integer> :

 public int hashCode() { return value; } 

In other words, the hash value (default) for java.lang.Integer is the integer itself.

What is more effective than that?

0
source share

The right way to do this is with a linked list. When you insert an element, you go down the linked list, if you increase the number of nodes there, otherwise create a new node with a score of 1. After you insert each element, you will have a sorted list of elements in O (n * log (n )).

For your methods, you make n inserts, and then sort in O (n * log (n)), so your complexity factor is higher.

0
source share

All Articles