Creating a unique list from a dataset too large to fit in memory

Question

Creating a unique list from a dataset too large to fit in memory

I have a list of 120 million records of about 40/50 bytes each, which is about 5.5 / 6 gigabytes of raw memory space, not including the extra storage needed to store the array in memory.

I would like to make sure this list is unique. The way I tried to do this is to create a Hashset <string> and add all the entries to it one by one.

When I get about 33 million records, I don't have enough memory, and list creation slows down to a crawl.

Is there a better way to sort this massive list of records in a timely manner? The only solution I can think of is to use the Amazon EC2 High-Memory Quadruple Extra Large Instance for an hour.

thank

+5

c # .net hashset

gary Jan 05 '11 at 8:23

source share

3 answers

, , , . HashSet, .

+4

Amber 05 . '11 8:27

You can always work in a sqlite database with a unique index, as this can help with further processing in the data set.

-1

user3791372 Dec 16 '16 at 19:46

source share

Jon Skeet · Accepted Answer · 2011-01-05T08:26:36+0000

If you are just trying to verify uniqueness, I would simply split the input sequence into buckets and then check each item separately.

, , 26 , , ( , AZ - , ). , - , , . bucketing , , .

, , bucketing, . , -, , 5 - 32 . , , . " " , :)

Creating a unique list from a dataset too large to fit in memory

More articles: