Python massive array of numbers using too much memory

Customization

  • Python 2.6
  • Ubuntu x64

I have a set of unique integers with values ​​from 1 to 50 million. New integers are added randomly, for example. numberset.add(random.randint(1, 50000000)) . I need to be able to quickly add new integers and quickly check if an integer is present.

Problem

After a while, the set gets too large for my low memory system and I experience MemoryError s.

Question

How can I achieve this using less memory? What is the fastest way to do this using a disk without reconfiguring the system, for example. swapfiles? Should I use a database file such as sqlite? Is there a library that will compress integers in memory?

+4
source share
6 answers

You can avoid dependencies on third-party bit array modules by writing your own - the required functionality is minimal:

 import array BITS_PER_ITEM = array.array('I').itemsize * 8 def make_bit_array(num_bits, initially=0): num_items = (num_bits + BITS_PER_ITEM - 1) // BITS_PER_ITEM return array.array('I', [initially]) * num_items def set_bit(bit_array, offset): item_index = offset // BITS_PER_ITEM bit_index = offset % BITS_PER_ITEM bit_array[item_index] |= 1 << bit_index def clear_bit(bit_array, offset): item_index = offset // BITS_PER_ITEM bit_index = offset % BITS_PER_ITEM bit_array[item_index] &= ~(1 << bit_index) def get_bit(bit_array, offset): item_index = offset // BITS_PER_ITEM bit_index = offset % BITS_PER_ITEM return (bit_array[item_index] >> bit_index) & 1 
+5
source

Use a bit array . This will reduce the need for huge space requirements.

Implemented SO Question:

+5
source

Use an array of bits as flags for each integer - the required memory will be only 50 million bits (about 6 MB). There are several modules that can help. This example uses bitstring , another bitarray option:

 from bitstring import BitArray i = BitArray(50000000) # initialise 50 million zero bits for x in xrange(100): v = random.randint(1, 50000000) if not i[v]: # Test if it already present i.set(1, v) # Set a single bit 

Setting and checking bits is very fast and it uses very little memory.

+2
source

Try using an array .

+1
source

If integers are unique, use bits. Example: binary 01011111 means that there are: 1, 3, 4, 5, 6 and 7. Thus, each bit is used to check whether its integer index is used (value 1) or not (value 0).

In the chapter (look for "A file contains no more than ten million entries, each entry is a seven-digit integer.")

It seems that the bitarray module mentioned by Emil works this way.

0
source

You may also consider a color filter, depending on your requirements. This is an effective data storage structure for testing if an item is in a set. The catch is that it can give false positives, although it will never give false negatives.

0
source

All Articles