Opening a 25 GB text file for processing

I have a 25 gigabyte file that I need to process. Here is what I am doing now, but it takes a very long time to open:

collection_pricing = os.path.join(pricing_directory, 'collection_price')
with open(collection_pricing, 'r') as f:
    collection_contents = f.readlines()

length_of_file = len(collection_contents)

for num, line in enumerate(collection_contents):
    print '%s / %s' % (num+1, length_of_file)
    cursor.execute(...)

How could I improve this?

+4
source share
3 answers
  • If the lines in your file are really, really big, do not print progress on each line. Printing to the terminal is very slow. Print progress, for example. every 100 or every 1000 lines.

  • Use available operating systems to get file size - os.path.getsize()see Get file size in Python?

    / li>
  • readlines(), 25 . , ., , , ​​python

+6

: , . readlines - . ( , readlines . silly.)

(, , - , - , , , .)

+3

, .

size_of_file = os.path.getsize(collection_pricing)
progress = 0
line_count = 0

with open(collection_pricing, 'r') as f:
    for line in f:
        line_count += 1  
        progress += len(line)
        if line_count % 10000 == 0:
            print '%s / %s' % (progress, size_of_file)

:

  • Does not use readlines(), so as not to store everything in memory
  • Print only every 10,000 lines
  • Using file size instead of line count to measure progress, so no need to double iterate the files.
+1
source

All Articles