Reading a large file into a dictionary

Question

Reading a large file into a dictionary

I have a 1GB file containing pairs of lines and long. What is the best way to read it in a dictionary, and how much memory would you say it requires?

The file has 62 million lines. I managed to read it using 5.5 GB of RAM.

Say 22 bytes for dictionary overhead, 1.5 GB. long - 8 bytes, this is 500 MB. The average line length is 15 characters, each char is 2 bytes, 2GB. Only about 4 GB, where else 1.5 GB go?

The initial distribution of the dictionary is 256 MB. I noticed that every 10 million lines that I read consume about 580 MB, which is very consistent with the above calculation, but somewhere around the 6000th line, memory usage increases from 260 MB to 1.7 GB, which I'm missing 1.5 GB, where does it go?

Thanks.

+3

performance c # file filesystems memory

Meidan alon Dec 05 '08 at 14:24

source share

10 answers

Robert rossney · Answer 1 · 2008-12-05T20:10:25+0000

It is important to understand what happens when you fill out a hashtable. (The dictionary uses a hashtable as the underlying data structure.)

When you create a new Hashtable, .NET creates an array containing 11 codes that are linked by lists of dictionary entries. When you add an entry, its key receives a hash, the hash code falls into one of 11 codes, and the record (key + value + hash code) is added to the linked list.

At a certain point (and this depends on the load factor used in constructing the Hashtable), the Hashtable determines during the add operation that it encounters too many collisions and that the initial 11 codes are not enough, Thus, it creates a new array of buckets, which in twice the old one (not exactly, the number of buckets is always simple), and then fills in a new table from the old one.

Thus, in terms of memory usage, there are two things.

Firstly, every so often a hashtable needs to use twice as much memory as it currently does, so it can copy a table while resizing. Therefore, if you have a Hashtable that uses 1.8 GB of memory and needs to be changed, it will need to use 3.6 GB, and, well, now you have a problem.

Secondly, each entry in the hash table contains about 12 bytes of overhead data: pointers to a key, a value and the next entry in the list, as well as a hash code. For most applications, this overhead is negligible, but if you create a Hashtable with 100 million entries in it, then this is approximately 1.2 GB of overhead.

You can overcome the first problem by using the Dictionary constructor overload, which allows you to provide initial capacity. If you specify a capacity sufficient to hold all the records you are about to add, the Hashtable will not need to be rebuilt while you fill it. There is almost nothing to do with the second.

James curran · Answer 2 · 2008-12-05T14:52:32+0000

It seems that everyone here agrees that the best way to handle this is to read only part of the file in memory at a time. The speed, of course, is determined by what part is in the memory and which parts should be read from the disk when a certain part of the information is required.

There is an easy way to deal with the best parts to keep in mind:

Put the data in the database.

Real like MSSQL Express, MySql or Oracle XE (all free).

Databases cache the most commonly used information, so it's just a read from memory. And they give you a single access method for data in memory or on disk.

huseyint · Answer 3 · 2008-12-05T14:50:46+0000

Maybe you can convert this 1 GB file to a SQLite database with two columns and a value. Then create an index in the key column. After that, you can query this database to get the values of the provided keys.

Andrew rollings · Answer 4 · 2008-12-05T14:38:21+0000

Thinking about it, I wonder why you need it ... (I know, I know ... I should not be surprised why, but listen to me ...)

The main problem is that there is a huge amount of data that probably should be quickly available ... The question is, will it be mostly random access or is there some kind of template that can be used to predict access?

In any case, I would implement this as a rolling cache. For example. I would load as much as possible into memory, to start with (with the choice of what you need to load as much as possible on my expected access pattern), and then track access to the elements by the time of the last access. If I hit something that was not in the cache, then it will load and replace the oldest item in the cache.

This will lead to the fact that the most frequently used materials will be available in memory, but will carry additional work for skipping the cache.

In any case, without knowing a bit more about the problem, this is simply a “general solution”.

It might be enough to just save it in a local sql db instance :)

Jon skeet · Answer 5 · 2008-12-05T14:29:31+0000

You will need to specify the file format, but if it is something like name = value, I would do:

Dictionary<string,long> dictionary = new Dictionary<string,long>(); using (TextReader reader = File.OpenText(filename)) { string line; while ((line = reader.ReadLine()) != null) { string[] bits = line.Split('='); // Error checking would go here long value = long.Parse(bits[1]); dictionary[bits[0]] = value; } }

Now, if this does not work, we will need to learn more about the file - how many lines there are, etc.

Do you use 64-bit Windows? (If not, you cannot use more than 3 GB per process, IIRC.)

The amount of memory required will depend on the length of the lines, the number of records, etc.

Thomas Tempelmann · Answer 6 · 2008-12-05T15:44:51+0000

I am not familiar with C #, but if you have memory problems, you may need to collapse your own memory container for this task.

Since you want to save it in a dict, I assume you need it for a quick search? You did not specify which one should be the key.

Suppose you want to use long values for keys. Then try the following:

Select a buffer the size of a file. Read the file in this buffer.

Then create a dictionary with long values (32-bit values, I think?) As keys, and their values will also have a value of 32 bits.

Now view the data in the buffer as follows: Find the next key-value pair. Calculate the offset of its value in the buffer. Now add this information to the dictionary with a long key and offset as the value.

Thus, you get a dictionary that can possibly occupy 10-20 bytes per record, and one larger buffer that contains all your text data.

At least with C ++, I think it would be quite memory efficient.

Markr · Answer 7 · 2008-12-06T11:28:12+0000

Is it possible to convert a 1G file to a more efficient indexed format, but leave it as a file on disk? You can then access it as needed and perform an effective search.

Perhaps you can write a map of the contents of this file (in a more efficient format) and then have minimal use and download on demand, which can be a good compromise between accessing the file directly on disk all the time and downloading everything to a large array of bytes.

Boyan · Answer 8 · 2008-12-05T14:40:53+0000

Downloading a 1 GB file to memory doesn't seem like a good idea to me right away. I would virtualize access to a file by loading it into smaller pieces only when a specific piece is needed. Of course, it will be slower than having the entire file in memory, but 1 GB is a real mastodon ...

dr. · Answer 9 · 2008-12-05T15:36:30+0000

Do not read 1 GB of the file into memory, even if you have 8 GB of physical memory, you can still have so many problems. - based on personal experience -

I do not know what you need to do, but find a workaround and read in part and process it. If this does not work, then you can use the database.

Constantin · Answer 10 · 2008-12-07T15:46:51+0000

If you decide to use a database, you might be better off using a dbm-style tool, such as Berkeley DB for.NET . They are specifically designed to represent disk-based hash tables.

Alternatively, you can overturn your own solution using some database methods.

Suppose your source data file looks like this (dots indicate that the length of the lines varies):

 [key2][value2...][key1][value1..][key3][value3....]

Divide it into a file of index files and values.

Values File:

 [value1..][value2...][value3....]

Index file:

 [key1][value1-offset] [key2][value2-offset] [key3][value3-offset]

Entries in the index file are pairs with a fixed size key->value-offset and are ordered by key. Lines in the values file are also ordered by key.

To get the value for key(N) , you have to do a binary search of key(N) in the index, then read the line from the values file, starting with value(N)-offset and ending with value(N+1)-offset .

An index file can be read into an array of arrays in memory (less overhead and a much more predictable memory consumption than a dictionary), or you can search directly on disk.

Reading a large file into a dictionary

More articles: