Map equivalent data structure (in java) for large datasets

Question

Map equivalent data structure (in java) for large datasets

Is there an already implemented data structure that I can use to assign an integer to the object (in my case Edge)? I am reading a graph from a file, 10 mil vertices, 60 mil edges, and I assign to each edge the cost using the map (cost.put (e, cost)).

I create a cost map in this way:

costs = new HashMap<Edge,Integer>();

The exception he gives is:

 java.lang.OutOfMemoryError: Java heap space at java.util.HashMap.resize(Unknown Source) at java.util.HashMap.addEntry(Unknown Source) at java.util.HashMap.put(Unknown Source)

+7

java directed-graph

Radu stejerean Oct 31 '12 at 9:12

source share

8 answers

You have 6e7 ribs. A regular object accepts 24 bytes (64-bit HotSpot), so 1.44e9 bytes are right there (1.5 GB). Now you present the most effective map you can imagine, adding only 67 links plus 67 Integer objects. This is another 2.4e8 bytes for refs and 1.44e9 bytes for Integer s: another 1.5 GB, the total number is now 3 GB, and this is the theoretical bottom line for your problem (modulo caching, see below).

Based on this, I suggest that you simply extend the Edge class with another int field. This will significantly reduce the amount of your memory.

If this is not an option, and:

all your integers rarely exceed two digits,
you try never to use new Integer , but Integer.valueOf , autoboxing etc.,
you are using Java 7,

You will automatically benefit from the built-in cache with a small amount. If integers take values from a wider range, but still with a lot of duplication, a custom cache is highly recommended.

+6

Marko topolnik Oct 31 '12 at 9:24

source share

In addition to changing jvms memory settings, you can configure HashMap memory management with initial capacity and load balance.

Excerpt from Javadoc:

A HashMap instance has two parameters that affect its performance: initial power and load factor. Capacity is the number of buckets in the hash table, and the initial capacity is just at the time the hash table was created. A load factor is a measure of how full a hash table is allowed before capacity automatically increases. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rephrased (that is, the internal structure data is rearranged), so the hash table has approximately twice the number of buckets.

+3

atamanroman Oct 31 '12 at 9:20

source share

Instead of creating a HashMap

 costs = new HashMap<Edge,Integer>();

You can save the value in the Edge object.

 class Edge { Vertex a; Vertex b; int cost; }

This way you can save some memory in the system.

+3

Azodious Oct 31 '12 at 9:21

source share

Back to the original problem: you have Edges that have costs. Since your graph is sparse, why not use a sparse matrix? Perhaps Object-to-Integer mapping is not what you really need and need. You can look at apache.commons.math, I think they have sparse matrices. In addition, you need to think about how you access the costs in your algorithms, choose the appropriate sparse format (column-based coding / row-based coding, or other). Or you don't care, and use any, but then you have to convert the thing at the beginning of your algorithms.

+2

kutschkem Oct 31 '12 at 9:22

source share

You understand that it takes a whole series of RAM, right? Try increasing the heap size and everything will be fine ...

And to answer your initial question: yes, this is what Map is for ...

+1

ppeterka Oct 31 '12 at 9:17

source share

You must indicate for each project how much heap space your project wants.

I think you could take the next step:

 Right mouse click - Run As - Run Configuration - Arguments - Vm Arguments, then add this -Xmx2048m

+1

Ami Oct 31 '12 at 9:18

source share

It is possible that you are looking for TObjectIntHashMap. This is similar to HashMap<Edge, Integer> , except that it saves int as a primitive, potentially saving you some memory. This collection can also be a little faster when the collection is larger (since it fits better in the cache)

 TObjectIntHashMap<Edge> costs = new TObjectIntHashMap<Edge>(); costs.put(e, cost); // cost is stored as a primitive not its wrapper object.

+1

Peter Lawrey Oct 31 '12 at 9:46

source share

Tim bender · Accepted Answer · 2012-10-31T09:17:28+0000

HashMap is the right data structure for the underlying Map . The problem you are facing is that the JVM is not instructed to reserve enough space to store the contents of the file in memory. Start the JVM using the -Xmx flag. For example, the -Xmx1G allows you to use 1 gigabyte of memory.

Map equivalent data structure (in java) for large datasets

More articles: