Best way to handle a large list of dictionaries in Python

I am doing a statistical test that uses 10,000 permutations as a zero distribution.

Each permutation is 10,000 keywords. Each key represents a gene; each value represents a set of patients corresponding to a gene. This dictionary is programmatically generated and can be written and read from a file.

I want to be able to iterate over these permutations to perform my statistical test; however, keeping this large list on the stack slows my performance.

Is there a way to store these dictionaries in stored memory and get permutations when I iterate over them?

Thanks!

+6
source share
1 answer

This is a general computational problem; you need the speed of the data stored in memory, but not enough memory. You have at least the following options:

  • Buy additional RAM (obviously)
  • Let the exchange process. This leaves OS OS to decide which data to store on disk and store in memory.
  • Do not load everything into memory at once

Since you are iterating over your dataset, one solution might be to load the data lazily:

def get_data(filename): with open(filename) as f: while True: line = f.readline() if line: yield line break for item in get_data('my_genes.dat'): gather_statistics(deserialize(item)) 

The option is to split your data into multiple files or save your data in a database so that you can periodically process your data n items.

+1
source

All Articles