Storing a huge hash table in a Python file

Hey. I have a function that I want to remember, but it has too many possible values. Is there any convenient way to save the values ​​in a text file and make them read them? For example, something like storing a pre-computed list of primes up to 10 ^ 9 in a text file? I know this is a slow read from a text file, but there is no other option if the amount of data is really huge. Thanks!

+4
source share
7 answers

For a list of primes up to 10**9 , why do you need a hash? What would the KEYS be ?! Sounds like a great opportunity for a simple, simple binary! According to the Simplest Number Theorem , there are 10**9/ln(10**9) such prime numbers - that is, 50 million or a little less. With 4 bytes per mailing, which is only 200 MB or less, it is ideal for array.array("L") and its methods, such as fromfile , etc. (See docs ). In many cases, you could suck all 200 MB into memory, but in the worst case you can get some of them (for example, via mmap and fromstring method), binary queries are executed there (for example, through bisect ), etc. etc.

When you need to store huge key values ​​- gigabytes, and not a negligible 200 MB!) - I recommended shelve , but after an unpleasant real experience with huge shelves (performance, reliability, etc.), I currently recommend using the database engine - sqlite is good and comes with Python, PostgreSQL is even better, non-relational ones like CouchDB might be even better, etc.

+11
source

You can use the shelve module to store the dictionary as a structure in a file. From the Python documentation:

 import shelve d = shelve.open(filename) # open -- file may get suffix added by low-level # library d[key] = data # store data at key (overwrites old data if # using an existing key) data = d[key] # retrieve a COPY of data at key (raise KeyError if no # such key) del d[key] # delete data stored at key (raises KeyError # if no such key) flag = d.has_key(key) # true if the key exists klist = d.keys() # a list of all existing keys (slow!) # as d was opened WITHOUT writeback=True, beware: d['xx'] = range(4) # this works as expected, but... d['xx'].append(5) # *this doesn't!* -- d['xx'] is STILL range(4)! # having opened d without writeback=True, you need to code carefully: temp = d['xx'] # extracts the copy temp.append(5) # mutates the copy d['xx'] = temp # stores the copy right back, to persist it # or, d=shelve.open(filename,writeback=True) would let you just code # d['xx'].append(5) and have it work as expected, BUT it would also # consume more memory and make the d.close() operation slower. d.close() # close it 
+6
source

You can also just go with maximum brute force and create a Python file with just one statement:

 seedprimes = [3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73, 79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173, ... 

and then just import it. (Here is a file with simple values ​​up to 1e5: http://python.pastebin.com/f177ec30 ).

 from primes_up_to_1e9 import seedprimes 
+3
source

For Project Euler, I saved a pre-computed list of primes up to 10 ** 8 in a text file, just writing them in a comma separated format. It worked well for this size, but it does not scale very well and becomes much larger.

If your huge is actually not as huge, I would use something simple like me, otherwise I would go with a shelf, as others have said.

+1
source

It’s just naive to stick to a hash table on a drive that will lead to 5 orders of magnitude lower performance compared to a memory implementation (or at least 3 if you have an SSD). When working with hard drives, you will want to extract every bit of data location and caching that you can get.

The right choice will depend on the details of your use case. What kind of performance do you need? What operations are needed for data structure? You only need to check if the table contains a key or do you need to get a value based on the key? Can you pre-copy the table or do you need to modify it on the fly? What speed do you expect? Can you filter out a significant number of operations with a flowering filter? Are the requests evenly distributed or do you expect some kind of temporary locality? Can you predict local clusters ahead of time?

If you don’t need maximum performance or you can parallelize and throw away hardware, check out some distributed key stores .

+1
source

You can also go one step down the stairs and use the pickle . Shelve imports from pickle ( link ), so if you do not need additional shelf functionality, this can save you some clock cycles (although it really doesn’t matter to you since you chose python for a lot of storage).

0
source

See where the bottleneck is. When you are about to read a file, the hard drive should rotate enough to read it; then it reads a large block and caches the results.

So, you need some method that will determine exactly what position in the file you are going to read, and then do it exactly once. I'm pretty sure that standard DB modules will work for you, but you can do it yourself - just open the file in binary mode for reading / writing and save your values, for example, 30-digit digits (= 100-bit = 13-byte )

Then use the standard file methods.

0
source

All Articles