What is the fastest way to read text columns in a large data file?

Question

What is the fastest way to read text columns in a large data file?

I have a data file of almost 9 million lines (soon there will be more than 500 million lines), and I'm looking for the fastest way to read it. Five aligned columns are padded with spaces, so I know where on each row to look for two fields that I want. My Python procedure takes 45 seconds:

import sys,time start = time.time() filename = 'test.txt' # space-delimited, aligned columns trans=[] numax=0 for line in open(linefile,'r'): nu=float(line[-23:-11]); S=float(line[-10:-1]) if nu>numax: numax=nu trans.append((nu,S)) end=time.time() print len(trans),'transitions read in %.1f secs' % (end-start) print 'numax =',numax

whereas the procedure I came up with in C is more pleasant for 4 seconds:

 #include <stdio.h> #include <stdlib.h> #include <time.h> #define BPL 47 #define FILENAME "test.txt" #define NTRANS 8858226 int main(void) { size_t num; unsigned long i; char buf[BPL]; char* sp; double *nu, *S; double numax; FILE *fp; time_t start,end; nu = (double *)malloc(NTRANS * sizeof(double)); S = (double *)malloc(NTRANS * sizeof(double)); start = time(NULL); if ((fp=fopen(FILENAME,"rb"))!=NULL) { i=0; numax=0.; do { if (i==NTRANS) {break;} num = fread(buf, 1, BPL, fp); buf[BPL-1]='\0'; sp = &buf[BPL-10]; S[i] = atof(sp); buf[BPL-11]='\0'; sp = &buf[BPL-23]; nu[i] = atof(sp); if (nu[i]>numax) {numax=nu[i];} ++i; } while (num == BPL); fclose(fp); end = time(NULL); fprintf(stdout, "%d lines read; numax = %12.6f\n", (int)i, numax); fprintf(stdout, "that took %.1f secs\n", difftime(end,start)); } else { fprintf(stderr, "Error opening file %s\n", FILENAME); free(nu); free(S); return EXIT_FAILURE; } free(nu); free(S); return EXIT_SUCCESS; }

Solutions in Fortran, C ++ and Java take intermediate intervals (27 seconds, 20 seconds, 8 seconds). My question is: did I make any outrageous mistakes in the above (especially the C code)? And is there a way to speed up the Python routine? I quickly realized that storing my data in an array of tuples is better than instantiating a class for each record.

+4

c python io dataset

Chris Sep 23 '10 at 13:53

source share

5 answers

John machin · Answer 1 · 2010-09-23T14:38:37+0000

Some moments:

Your C-procedure is cheating; it is transmitted with file size and pre-allocates ...
Python: consider array.array ('d') ... one for S and nu. Then try preselection.
Python: write your subroutine as a function and call it - access to functions - local variables are faster than access to global variables of the module.

Blagovest buyukliev · Answer 2 · 2010-09-23T14:05:22+0000

In the C implementation, you can try replacing the library functions fopen() / fread() / fclose() for lower level system calls open() / read() / close() . The acceleration may be due to the fact that fread() does a lot of buffering, while read() does not.

In addition, calling read() less often with large chunks will reduce the number of system calls, and therefore you will have less switching between user space and kernel space. What the kernel does when it issues the read() system call (it doesn’t matter if it was called by the fread() library function), it reads data from disk and then copies it to user space. The copy part becomes expensive if you often call a system call in your code. When reading larger snippets, you get fewer context switches and less copy.

Keep in mind that read() not guaranteed to return a block with the exact number of bytes you need. That's why in a reliable and correct implementation, you always need to check the return value of read() .

torak · Answer 3 · 2010-09-23T14:22:29+0000

An approach that could probably be applied to versions of C, C ++, and python would be to use a memory card for the file. The most significant advantage is that it can reduce the amount of double data processing, since it is copied from one buffer to another. In many cases, there are also advantages due to the reduction in the number of system calls for I / O.

caf · Answer 4 · 2010-09-24T00:35:27+0000

You have arguments 1 and BPL wrong way in fread() (as you have it, it can read a partial line that you are not testing). You should also test the return value of fread() before trying to use the return data.

You might be able to speed up version C a bit by reading more lines at a time

 #define LINES_PER_READ 1000 char buf[LINES_PER_READ][BPL]; /* ... */ while (i < NTRANS && (num = fread(buf, BPL, LINES_PER_READ, fp)) > 0) { int line; for (line = 0; i < NTRANS && line < num; line++) { buf[line][BPL-1]='\0'; sp = &buf[line][BPL-10]; S[i] = atof(sp); buf[line][BPL-11]='\0'; sp = &buf[line][BPL-23]; nu[i] = atof(sp); if (nu[i]>numax) {numax=nu[i];} ++i; } }

On systems supporting posix_fadvise() , you should also do this in advance after opening the file:

 posix_fadvise(fileno(fp), 0, 0, POSIX_FADV_SEQUENTIAL);

sizzzzlerz · Answer 5 · 2010-09-23T14:22:57+0000

Another possible speedup, given the number of times you need to do, is to use pointers to S and nu instead of indexing into arrays, e.g.

 double *pS = S, *pnu = nu; ... *pS++ = atof(sp); *pnu = atof(sp); ...

Also, since you always convert from char to double in the same places in buf, pre-compute the addresses outside of your loop, rather than compute them every time in the loop.

What is the fastest way to read text columns in a large data file?

More articles: