Compact data structure for storing a large set of integral values

Question

Compact data structure for storing a large set of integral values

I am working on an application that should pass large sets of Int32 values. The sets are expected to contain ~1,000,000-50,000,000 items, where each item is a database key in the range 0-50,000,000 . I expect that the distribution of identifiers in any given set will be effectively random in this range. The operations I need on the set are simple:

Add new value
Iterate over all values.

There is a serious memory usage problem with these sets, so I'm looking for a data structure that can store identifiers more efficiently than a simple List<int> or HashSet<int> . I looked at BitArray , but it can be wasteful depending on how sparse the identifiers are. I also looked at the bitwise trie , but I'm not sure how to calculate the space efficiency of this solution for the expected data. The Bloom filter would be great if I could tolerate false negatives.

I would appreciate any suggestions for data structures suitable for this purpose. I am interested in both ready-made and non-standard solutions.

EDIT . To answer your questions:

No, items do not need to be sorted
By "transition" I mean how to get between methods and serialize and send by wire. I obviously should have mentioned this.
A decent amount of these sets (~ 100) may appear in memory.

+6

c # algorithm data-structures

Odrade Mar 08 '11 at 23:23

source share

3 answers

I think the answer depends on what you mean by “walkthrough” and what you are trying to accomplish. You say that you add only to the list: how often do you add? How fast will the list grow? What is the acceptable overhead for memory usage, as well as memory reallocation time?

In the worst case scenario, 50,000,000 32-bit numbers = 200 megabytes using the most efficient storage engine. Assuming that in the end you can use this at your worst, is it normal to use this memory all the time? Is this better than reallocating memory often? What is the distribution of typical usage patterns? You can always use int[] , which has been allocated for just 50 million.

As for the speed of access to your operations, nothing happens faster than repeating and adding to a previously allocated memory fragment.

From OP edit: A decent amount of these sets may appear in memory (~ 100).

Hey now. Do you need to store 100 sets from 1 to 50 million numbers in memory at once? I think the bitter method is the only possible way this could work.

It will be 600 megabytes. Minor, but if they are (usually) mostly empty, it seems unlikely that you will find a more efficient storage mechanism.

Now, if you do not use bits, but rather use constructs with dynamic size, and they can somehow take up less space, you are talking about a real ugly scenario of allocating / freeing memory / garbage.

Suppose you really need to do this, although I can only imagine why. Thus, your server received a ton of memory, just select as many of these 6 megabyte bits as you need and recycle them. Disposal and collection of garbage is no longer a problem. Yes, you use a ton of memory, but that seems inevitable.

+1

Jamie Treworgy Mar 08 '11 at 23:33

source share

This will depend on the size distribution of your sets. If you do not expect most of the sets to be (close) to the minimum you specified, I would probably use the bitrate. To cover a range of up to 50,000,000, the bit-bit ends with ~ 6 megabytes.

Compared to storing numbers directly, this is slightly larger for the set minimum size (~ 6 megabytes instead of ~ 4), but significantly smaller for the maximum size (1/32 ^nd size).

The second option is to use delta coding. For example, instead of storing each number directly, save the difference between that number and the previous number that was included. With a maximum value of 50,000,000 and a minimum size of 1,000,000 items, the average difference between one number and the next is ~ 50. This means that you can theoretically keep the difference on average by 6 bits. I would probably use the 7 least significant bits directly, and if you need to encode a larger space, set msb and (for example) keep the size of the space in the lower 7 bits plus the next three bytes. This happens very often, so in most cases you use only one byte per number, approximately for 4: 1 compression, compared to storing numbers directly. In the best case, this will use ~ 1 megabytes for dialing, and in the worst - about 50 megabytes - 4: 1 compression compared to storing numbers directly.

If you are not against a small additional code, you can use the adaptive scheme - delta coding for small sets (up to 6,000,000 numbers) and a bitmap image for larger sets.

+1

Jerry Coffin Mar 08 '11 at 23:56

source share

Fred foo · Accepted Answer · 2011-03-08T23:33:10+0000

Use BitArray . It uses only about 6 MB of memory; the only real problem is that the Itta (N) iteration, i.e. you need to go all the way. Link locality is good, although you can select the entire structure in one operation.

Regarding the waste of space: you lose 6 MB in the worst case.

EDIT : OK, you have a lot of sets, and you serialize. For serialization on disk, I propose 6 MB files :)

To send by wire, just iterate and consider sending ranges instead of individual elements. This requires a sorting structure.

You need a lot of these sets. Think about whether you have 600 MB of spare. Otherwise, check:

Gradually trying: O (1) insert, O (n) iteration, much lower persistent factors than bitwise attempts
Custom hash table, possibly Google sparsehash via C ++ / CLI
BST preserving ranges / intervals
Heavy Duty BST

Compact data structure for storing a large set of integral values

More articles: