Split large csv text file based on column value

I have CSV files that have multiple columns that are sorted. For example, I might have lines like this:

19980102,,PLXS,10032,Q,A,,,15.12500,15.00000,15.12500,2 19980105,,PLXS,10032,Q,A,,,14.93750,14.75000,14.93750,2 19980106,,PLXS,10032,Q,A,,,14.56250,14.56250,14.87500,2 20111222,,PCP,63830,N,A,,,164.07001,164.09000,164.12000,1 20111223,,PCP,63830,N,A,,,164.53000,164.53000,164.55000,1 20111227,,PCP,63830,N,A,,,165.69000,165.61000,165.64000,1 

I would like to split the file based on the third column, for example. put PLXS and PCP records in their own files called PLXS.csv and PCP.csv. Since the file is pre-sorted, all PLXS entries are in front of PCP inputs, etc.

I generally do things like this in C ++, because the language that I know best, but in this case, my input CSV file has several gigabytes and is too large to load into C ++ memory.

Can someone show how this can be done? Perl / Python / php / bash solutions are fine, they just need to be able to process a huge file without excessive memory usage.

+10
split text csv large-data
source share
6 answers

C ++ is great if you know this best. Why are you still trying to load the entire file into memory?

Since the output depends on the column read, you can easily store the buffers for the output files and write the record to the corresponding file as the processing progresses, clearing as you move to keep the memory size relatively small.

I do this (albeit in java) when I need to take massive excerpts from the database. Records are inserted into the file buffer stream, and everything in memory is cleared, so the program trace never grows higher than where it starts.

Fly in the seat of the pseudo-code of my pants:

  • Create a list to store output file buffers
  • Open the stream in a file and start reading in the contents of one line at a time
  • Have we encountered an open file stream record for this type of content?
    • Yes -
      • Get saved file stream
      • save record to this file
      • clear stream
    • Not -
      • create a stream and save it in our stream list
      • save record to stream
      • clear stream
  • Rinse again ...

Basically continue this processing until we finish the file.

Since we never store more pointers to streams, and we erase as soon as we write to streams, we never store anything in the application memory other than a single record from the input file. Thus, the track remains adjustable.

+1
source share

Here is the old line for you (just replace >> with > to truncate the output files each time you start):

 awk -F, '{print >> ($3".csv")}' input.csv 

Due to the popular demand (and the itch I just had), I also wrote a version that will duplicate the header lines for all files:

 awk -F, '{fn=$3".csv"} NR==1 {hdr=$0} NR>1&&!($3 in p) {p[$3]; print hdr > fn} NR>1 {print >> fn}' input.csv 

But you can just start with this and end with the first awk:

 HDR=$(head -1 input.csv); for fn in $(tail -n+2 input.csv | cut -f3 -d, | sort -u); do echo $HDR > $fn.csv; done 

Most modern systems include the awk binary, but if you don’t have one, you can find exe in Gawk for Windows

+30
source share
 perl -F, -ane '`echo $_ >> $F[2].csv`' < file 

The following command line options are used:

  • -n loop around each line of input file
  • -l removes new lines before processing and adds them back
  • -a automatic mode - splitting input lines into @F array. By default it is split into spaces.
  • -e execute perl code
  • -F auto-expansion modifier, in this case is broken into ,

@F is an array of words in each line, indexed starting at $F[0]


If you want to keep the title, you will need a more complex approach.

perl splitintofiles.pl file

Contents of splitintofiles.pl:

 open $fh, '<', $ARGV[0]; while ($line = <$fh>) { print $line; if ($. == 1) { $header = $line; } else { # $fields[2] is the 3rd column @fields = split /,/, $line; # save line into hash %c $c{"$fields[2].csv"} .= $line; } } close $fh; for $file (keys %c) { print "$file\n"; open $fh, '>', $file; print $fh $header; print $fh $c{$file}; close $fh; } 

input:

 a,b,c,d,e,f,g,h,i,j,k,l 19980102,,PLXS,10032,Q,A,,,15.12500,15.00000,15.12500,2 19980105,,PLXS,10032,Q,A,,,14.93750,14.75000,14.93750,2 19980106,,PLXS,10032,Q,A,,,14.56250,14.56250,14.87500,2 20111222,,PCP,63830,N,A,,,164.07001,164.09000,164.12000,1 20111223,,PCP,63830,N,A,,,164.53000,164.53000,164.55000,1 20111227,,PCP,63830,N,A,,,165.69000,165.61000,165.64000,1 

PCP.csv output

 a,b,c,d,e,f,g,h,i,j,k,l 20111222,,PCP,63830,N,A,,,164.07001,164.09000,164.12000,1 20111223,,PCP,63830,N,A,,,164.53000,164.53000,164.55000,1 20111227,,PCP,63830,N,A,,,165.69000,165.61000,165.64000,1 

output of PLXS.csv

 a,b,c,d,e,f,g,h,i,j,k,l 19980102,,PLXS,10032,Q,A,,,15.12500,15.00000,15.12500,2 19980105,,PLXS,10032,Q,A,,,14.93750,14.75000,14.93750,2 19980106,,PLXS,10032,Q,A,,,14.56250,14.56250,14.87500,2 
+1
source share

An alternative solution would be to load the CSV into the Solr index, and then generate the CSV files based on your custom search criteria.

Here's the main HOWTO:

Create report and upload to server for download

0
source share

If there are no quoted commas in the first three columns of your file, a simple one-line:

 cat file | perl -e 'while(<>){@a=split(/,/,$_,4);$key=$a[2];open($f{$key},">$key.csv") unless $f{$key};print {$f{$key}} $_;} for $key (keys %f) {close $f{$key}}' 

It does not consume a lot of memory (only individual characters (3rd_column) β†’ file-handle are saved), and the lines can come in any order.

If the columns are more complex (for example, contain commas), use Text::CSV .

0
source share

If there is no header line in the input file

 awk -F, ' {fn = $3".csv" print > fn}' bigfile.csv 

If there is a header line that should be passed to split files

 awk -F, ' NR==1 {hdr=$0; next} {fn = $3".csv"} !seen[$3]++{print hdr > fn} {print > fn}' bigfile.csv 
0
source share

All Articles