Processing a large text file

I need to implement lazy loading in Mathematica. I have a 600 MB CSV text file that I need to process. This file contains many duplicate entries:

1;0;0;13;6 1;0;0;13;6 .......... 2;0;0;13;6 2;0;0;13;6 .......... etc. 

Therefore, instead of loading them all into memory, I would like to create a list containing entries and the number of times this entry was found in the file:

 {{10000,{1,0,0,13,6}}, {20000,{2,0,0,13,6}}, ...} 

I could not find a way to do this using the import function. I'm looking for something like

 Import["my_file.csv", "CSV", myProcessingFunction] 

where myProcessingFunction will take one record at a time and create a dataset. Can this be done using import or any other Mathematica function?

+6
import wolfram-mathematica text-processing
source share
4 answers

If it were me, I would do it using unix sort and uniq , but since you are asking about Mathematica .... I would use ReadList [] to read line blocks and determine downvalues ​​to find unique lines, keep track of how much we seen before.

 (* Create some test data *) Export["/tmp/test.txt", Flatten[{Range[1000], Range[1000]}], "Lines"]; countUniqueLines[file_String, blockSize_Integer] := Module[{stream, map, block, keys, out}, map[_]:=0; stream = OpenRead[file]; CheckAbort[While[(block=ReadList[stream, String, blockSize])=!={}, (map[#]=map[#]+1)& /@ block;];, Close[stream];Clear[map]]; Close[stream]; keys = Cases[DownValues[map][[All, 1, 1, 1]], _String]; out = {#, map[#]}& /@ keys; Clear[map]; out ] countUniqueLines["/tmp/test.txt", 500] (* Alternative implementation if you have a little more memory *) Tally[Import["/tmp/test.txt", "Lines"]] 
+2
source share

I think you need the Read[] function.

+2
source share

Perhaps there are better alternatives for this than Mathematica.

Little awk script:

  {a[$0]++} END { ... print loop ... } 

will accumulate duplicate entries. Of course, you may suffer from overflow depending on the number of individual entries.

Or sort the file first, and the count will not overflow. In awk, a non-overflows program might be something like:

  BEGIN{ p =""; i=0} {if (($0 != p) && (i != 0) ) {print $0,i ; p =$0; i=0; next}} {i++; p = $0} 

Perl may be better, but I'm old fashioned.

NTN!

+2
source share

I would recommend that you first load it into a database system such as MySQL, and then you can access it from Mathematica using DatabaseLink.

0
source share

All Articles