Python row reading performance

Now I am writing Python code to work with massive twitter files. These files are so large that they cannot fit into memory. To work with them, I basically have two options.

  • I could split files into smaller files that can fit into memory.

  • I can process a large file line by line, so I don’t need to immediately write the entire file into memory. I would prefer the latter to simplify the implementation.

However, I wonder if it is faster to read the entire file in memory, and then manipulate it from there. It seems like it can be slow to read the file continuously line by line from disk. But then again, I don’t quite understand how these processes work in Python. Does anyone know if reading line by line will cause my code to be slower than if I read the entire file in memory and just manipulated it from there?

+7
source share
2 answers

For a really quick read of files, check out the mmap module. This will cause the entire file to appear as a large piece of virtual memory, even if it is much larger than your RAM. If your file is larger than 3 or 4 gigabytes, then you will want to use a 64-bit OS (and a 64-bit Python build).

I did this for files larger than 30 GB with good results.

+9
source

If you want to process the file line by line, you can simply use the file object as an iterator:

for line in open('file', 'r'): print line 

This is a fairly efficient memory; if you want to work with a batch of lines at a time, you can also use the readlines() method of the file object with the sizehint parameter. This is read in sizehint bytes plus enough bytes to complete the last line.

+1
source

All Articles