Processing a large text file

Question

Processing a large text file

I need to implement lazy loading in Mathematica. I have a 600 MB CSV text file that I need to process. This file contains many duplicate entries:

1;0;0;13;6 1;0;0;13;6 .......... 2;0;0;13;6 2;0;0;13;6 .......... etc.

Therefore, instead of loading them all into memory, I would like to create a list containing entries and the number of times this entry was found in the file:

 {{10000,{1,0,0,13,6}}, {20000,{2,0,0,13,6}}, ...}

I could not find a way to do this using the import function. I'm looking for something like

 Import["my_file.csv", "CSV", myProcessingFunction]

where myProcessingFunction will take one record at a time and create a dataset. Can this be done using import or any other Mathematica function?

+6

import wolfram-mathematica text-processing

Max Nov 26 '10 at 12:00

source share

4 answers

I think you need the Read[] function.

+2

High performance mark Nov 26 '10 at 13:03

source share

Perhaps there are better alternatives for this than Mathematica.

Little awk script:

  {a[$0]++} END { ... print loop ... }

will accumulate duplicate entries. Of course, you may suffer from overflow depending on the number of individual entries.

Or sort the file first, and the count will not overflow. In awk, a non-overflows program might be something like:

  BEGIN{ p =""; i=0} {if (($0 != p) && (i != 0) ) {print $0,i ; p =$0; i=0; next}} {i++; p = $0}

Perl may be better, but I'm old fashioned.

NTN!

+2

Dr. belisarius Nov 26 '10 at 15:06

source share

I would recommend that you first load it into a database system such as MySQL, and then you can access it from Mathematica using DatabaseLink.

0

gdelfino Nov 27 '10 at 15:57

source share

Joshua martell · Accepted Answer · 2010-11-27T01:40:06+0000

If it were me, I would do it using unix sort and uniq , but since you are asking about Mathematica .... I would use ReadList [] to read line blocks and determine downvalues to find unique lines, keep track of how much we seen before.

 (* Create some test data *) Export["/tmp/test.txt", Flatten[{Range[1000], Range[1000]}], "Lines"]; countUniqueLines[file_String, blockSize_Integer] := Module[{stream, map, block, keys, out}, map[_]:=0; stream = OpenRead[file]; CheckAbort[While[(block=ReadList[stream, String, blockSize])=!={}, (map[#]=map[#]+1)& /@ block;];, Close[stream];Clear[map]]; Close[stream]; keys = Cases[DownValues[map][[All, 1, 1, 1]], _String]; out = {#, map[#]}& /@ keys; Clear[map]; out ] countUniqueLines["/tmp/test.txt", 500] (* Alternative implementation if you have a little more memory *) Tally[Import["/tmp/test.txt", "Lines"]]

Processing a large text file

More articles: