I have a list of 120 million records of about 40/50 bytes each, which is about 5.5 / 6 gigabytes of raw memory space, not including the extra storage needed to store the array in memory.
I would like to make sure this list is unique. The way I tried to do this is to create a Hashset <string> and add all the entries to it one by one.
When I get about 33 million records, I don't have enough memory, and list creation slows down to a crawl.
Is there a better way to sort this massive list of records in a timely manner? The only solution I can think of is to use the Amazon EC2 High-Memory Quadruple Extra Large Instance for an hour.
thank
If you are just trying to verify uniqueness, I would simply split the input sequence into buckets and then check each item separately.
, , 26 , , ( , AZ - , ). , - , , . bucketing , , .
, , bucketing, . , -, , 5 - 32 . , , . " " , :)
, , , . HashSet, .
You can always work in a sqlite database with a unique index, as this can help with further processing in the data set.