I have a data file of almost 9 million lines (soon there will be more than 500 million lines), and I'm looking for the fastest way to read it. Five aligned columns are padded with spaces, so I know where on each row to look for two fields that I want. My Python procedure takes 45 seconds:
import sys,time start = time.time() filename = 'test.txt' # space-delimited, aligned columns trans=[] numax=0 for line in open(linefile,'r'): nu=float(line[-23:-11]); S=float(line[-10:-1]) if nu>numax: numax=nu trans.append((nu,S)) end=time.time() print len(trans),'transitions read in %.1f secs' % (end-start) print 'numax =',numax
whereas the procedure I came up with in C is more pleasant for 4 seconds:
#include <stdio.h> #include <stdlib.h> #include <time.h> #define BPL 47 #define FILENAME "test.txt" #define NTRANS 8858226 int main(void) { size_t num; unsigned long i; char buf[BPL]; char* sp; double *nu, *S; double numax; FILE *fp; time_t start,end; nu = (double *)malloc(NTRANS * sizeof(double)); S = (double *)malloc(NTRANS * sizeof(double)); start = time(NULL); if ((fp=fopen(FILENAME,"rb"))!=NULL) { i=0; numax=0.; do { if (i==NTRANS) {break;} num = fread(buf, 1, BPL, fp); buf[BPL-1]='\0'; sp = &buf[BPL-10]; S[i] = atof(sp); buf[BPL-11]='\0'; sp = &buf[BPL-23]; nu[i] = atof(sp); if (nu[i]>numax) {numax=nu[i];} ++i; } while (num == BPL); fclose(fp); end = time(NULL); fprintf(stdout, "%d lines read; numax = %12.6f\n", (int)i, numax); fprintf(stdout, "that took %.1f secs\n", difftime(end,start)); } else { fprintf(stderr, "Error opening file %s\n", FILENAME); free(nu); free(S); return EXIT_FAILURE; } free(nu); free(S); return EXIT_SUCCESS; }
Solutions in Fortran, C ++ and Java take intermediate intervals (27 seconds, 20 seconds, 8 seconds). My question is: did I make any outrageous mistakes in the above (especially the C code)? And is there a way to speed up the Python routine? I quickly realized that storing my data in an array of tuples is better than instantiating a class for each record.
Chris source share