Opening a 25 GB text file for processing

Question

Opening a 25 GB text file for processing

I have a 25 gigabyte file that I need to process. Here is what I am doing now, but it takes a very long time to open:

collection_pricing = os.path.join(pricing_directory, 'collection_price')
with open(collection_pricing, 'r') as f:
    collection_contents = f.readlines()

length_of_file = len(collection_contents)

for num, line in enumerate(collection_contents):
    print '%s / %s' % (num+1, length_of_file)
    cursor.execute(...)

How could I improve this?

+4

performance python

David542 Sep 16 '14 at 22:14

source share

3 answers

: , . readlines - . ( , readlines . silly.)

(, , - , - , , , .)

+3

Sneftel 16 . '14 22:17

, .

size_of_file = os.path.getsize(collection_pricing)
progress = 0
line_count = 0

with open(collection_pricing, 'r') as f:
    for line in f:
        line_count += 1  
        progress += len(line)
        if line_count % 10000 == 0:
            print '%s / %s' % (progress, size_of_file)

:

Does not use readlines(), so as not to store everything in memory
Print only every 10,000 lines
Using file size instead of line count to measure progress, so no need to double iterate the files.

+1

David542 Sep 16 '14 at 23:56

source share

nos · Accepted Answer · 2014-09-16T22:24:20+0000

If the lines in your file are really, really big, do not print progress on each line. Printing to the terminal is very slow. Print progress, for example. every 100 or every 1000 lines.
Use available operating systems to get file size - os.path.getsize()see Get file size in Python?
/ li>
readlines(), 25 . , ., , , python

Opening a 25 GB text file for processing

More articles: