Python: performance issues with islice

Question

Python: performance issues with islice

With the following code, I see more and more runtime as I increase the start line in islice. For example, start_row of 4 will execute in 1 s, but start_row of 500004 will take 11 seconds. Why is this happening and is there a faster way to do this? I want to be able to iterate over multiple ranges of lines in a large CSV file (several GB) and do some calculations.

import csv import itertools from collections import deque import time my_queue = deque() start_row = 500004 stop_row = start_row + 50000 with open('test.csv', 'rb') as fin: #load into csv reader csv_f = csv.reader(fin) #start logging time for performance start = time.time() for row in itertools.islice(csv_f, start_row, stop_row): my_queue.append(float(row[4])*float(row[10])) #stop logging time end = time.time() #display performance print "Initial queue populating time: %.2f" % (end-start)

+1

performance python csv itertools

Colin Jul 04 '15 at 23:01

source share

2 answers

NightShadeQueen · Answer 1 · 2015-07-04T23:26:39+0000

For example, start_row of 4 will execute in 1 s, but start_row 500004 will take 11 seconds

It islice to be smart. Or lazy, depending on which term you prefer.

Thing is, files are just strings of bytes on your hard drive. They have no internal organization. \n is another set of bytes in this long long string. It is not possible to access any particular line without looking at all the information in front of it (unless your lines are the same length, in which case you can use file.seek ).

Line 4? The search for line 4 is fast, your computer just needs to find 3 \n . Line 50004? Your computer should read the file until it finds 500003 \n . In no case, and if someone tells you otherwise, they either have some other quantum computer, or their computer reads a file just like any other computer in the world, just behind them.

What you can do about this: try to be smart when trying to grab lines for iterating over. Smart and lazy. Organize your queries so that you only iterate through the file once, and close the file as soon as you pulled the necessary data. (islice does all this, by the way.)

In python

 lines_I_want = [(start1, stop1), (start2, stop2),...] with f as open(filename): for i,j in enumerate(f): if i >= lines_I_want[0][0]: if i >= lines_I_want[0][1]: lines_I_want.pop(0) if not lines_I_want: #list is empty break else: #j is a line I want. Do something

And if you have control over the creation of this file, make each line the same length so you can seek . Or use a database.

martineau · Answer 2 · 2015-07-06T02:59:57+0000

The problem with using islice() for what you are doing is that iterating through all the lines to the first one you want before returning anything. Obviously, the larger the start line, the longer it will take. Another is that you use csv.reader to read these lines, which leads to likely unnecessary overhead, since one line of the csv file is often one line. The only time this is not the case when there are string fields in the csv file containing embedded newline characters - which in my experience is unusual.

If this is a valid assumption for your data, it will most likely be much faster to index the file first and build a table (filename, offset, number-of-rows) of tuples indicating approximately the same logical fragments of lines / lines in the file. In this case, you can process them relatively quickly, first turning to the initial offset, and then reading the specified number of csv lines from this point.

Another advantage of this approach is that it allows you to process pieces in parallel, which, in my opinion, is a real problem that you are trying to solve based on a previous question from you. So, although you did not mention multiprocessing here, it was written to be compatible with this if this is the case.

 import csv from itertools import islice import os import sys def open_binary_mode(filename, mode='r'): """ Open a file proper way (depends on Python verion). """ kwargs = (dict(mode=mode+'b') if sys.version_info[0] == 2 else dict(mode=mode, newline='')) return open(filename, **kwargs) def split(infilename, num_chunks): infile_size = os.path.getsize(infilename) chunk_size = infile_size // num_chunks offset = 0 num_rows = 0 bytes_read = 0 chunks = [] with open_binary_mode(infilename, 'r') as infile: for _ in range(num_chunks): while bytes_read < chunk_size: try: bytes_read += len(next(infile)) num_rows += 1 except StopIteration: # end of infile break chunks.append((infilename, offset, num_rows)) offset += bytes_read num_rows = 0 bytes_read = 0 return chunks chunks = split('sample_simple.csv', num_chunks=4) for filename, offset, rows in chunks: print('processing: {} rows starting at offset {}'.format(rows, offset)) with open_binary_mode(filename, 'r') as fin: fin.seek(offset) for row in islice(csv.reader(fin), rows): print(row)

Python: performance issues with islice

More articles: