Using awk or perl to extract specific columns from CSV (parsing)

Background . I want to extract specific columns from a csv file. The csv file is separated by a comma, uses double quotes as a text classifier (optional, but when the field contains special characters, a qualifier will be displayed - see the Example) and uses a backslash as an escape character. Some fields may be empty.


An example of input and the desired output . For example, I want columns 1, 3, and 4 to appear in the output file. The final output of the columns from the csv file must match the format of the source file. No escape characters should be removed or extra quotes added, etc.

Enter

"John \"Super\" Doe",25,"123 ABC Street",123-456-7890,"M",A "Jane, Mary","",132 CBS Street,333-111-5332,"F",B "Smith \"Jr.\", Jane",35,,555-876-1233,"F", "Lee, Jack",22,123 Sesame St,"","M",D 

Desired output

 "John \"Super\" Doe","123 ABC Street",123-456-7890 "Jane, Mary",132 CBS Street,333-111-5332 "Smith \"Jr.\", Jane",,555-876-1233 "Lee, Jack",123 Sesame St,"" 

Preview Script (awk) . Below is a preliminary Script, which I found that works for the most part, but does not work in one specific instance, which I noticed, and maybe I have not seen and thought about it yet

 #!/usr/xpg4/bin/awk -f BEGIN{ OFS = FS = "," } /"/{ for(i=1;i<=NF;i++){ if($i ~ /^"[^"]+$/){ for(x=i+1;x<=NF;x++){ $i=$i","$x if($i ~ /"+$/){ z = x - (i + 1) + 1 for(y=i+1;y<=NF;y++) $y = $(y + z) break } } NF = NF - z i=x } } print $1,$3,$4 } 

The above seems to work well until it hits a field containing both hidden double quotes and a comma. In this case, parsing will be disabled and the output will be incorrect.


Question / Comments . I read that awk is not the best option for parsing through csv files, and perl is suggested. However, I do not know perl. I found some examples of perl scripts, but they do not give the desired result that I am looking for, and I do not know how easy it is to edit scripts for what I want.

As for awk, I am familiar with it and sometimes use basic functions, but I do not know many advanced functions, such as some of the commands used in Script above. Is my desired result possible just using awk? If so, is it possible to modify the Script above to fix the problem I am facing? Can someone explain line by line what exactly Script does?

Any help would be appreciated, thanks!

+7
source share
7 answers

I'm not going to reinvent the wheel.

 use Text::CSV_XS; my $csv = Text::CSV_XS->new({ binary => 1, escape_char => '\\', eol => "\n", }); my $fh_in = \*STDIN; my $fh_out = \*STDOUT; while (my $row = $csv->getline($fh_in)) { $csv->print($fh_out, [ @{$row}[0,2,3] ]) or die("".$csv->error_diag()); } $csv->eof() or die("".$csv->error_diag()); 

Output:

 "John \"Super\" Doe","123 ABC Street",123-456-7890 "Jane, Mary","132 CBS Street",333-111-5332 "Smith \"Jr.\", Jane",,555-876-1233 "Lee, Jack","123 Sesame St", 

It adds quotes around addresses that no longer have them, but since some addresses already have quotes around them, you can obviously handle this.


Wheel Override:

 my $field = qr/"(?:[^"\\]|\\.)*"|[^"\\,]*/s; while (<>) { my @fields = /^($field),$field,($field),($field),/ or die; print(join(',', @fields), "\n"); } 

Output:

 "John \"Super\" Doe","123 ABC Street",123-456-7890 "Jane, Mary",132 CBS Street,333-111-5332 "Smith \"Jr.\", Jane",,555-876-1233 "Lee, Jack",123 Sesame St,"" 
+10
source

I suggest python csv module:

 #!/usr/bin/env python3 import csv rdr = csv.reader(open('input.csv'), escapechar='\\') wtr = csv.writer(open('output.csv', 'w'), escapechar='\\', doublequote=False) for row in rdr: wtr.writerow(row[0:1]+row[2:4]) 

output.csv

 John \"Super\" Doe,123 ABC Street,123-456-7890 "Jane, Mary",132 CBS Street,333-111-5332 "Smith \"Jr.\", Jane",,555-876-1233 "Lee, Jack",123 Sesame St, 
+2
source

The following command will select the required fields (for example, the first, third and fourth), separated by the separator ',' from the sample.csv file and display the output in the console. cut -f1,3,4 -d ',' sample.txt If you want to save the output in a new CSV file, redirect the output to a file as shown below. cut -f1,3,4 -d ',' sample.txt> newSample.csv

0
source

Before posting, now I see that this is an old question that has already been answered by a remote answer. However, I thought that I would still use the opportunity to show Tie :: Array :: CSV , which makes manipulating CSV files as easy as working with Perl arrays. Full disclosure: I am the author.

Anyway, this is a script. OP data required a change to the escape character and Perl index arrays starting at 0, but other than that it should be readable.

 #!/usr/bin/env perl use strict; use warnings; use Tie::Array::CSV; my $opts = { text_csv => { escape_char => '\\' } }; tie my @input, 'Tie::Array::CSV', 'data', $opts or die "Cannot open file 'data': $!"; tie my @output, 'Tie::Array::CSV', 'out', $opts or die "Cannot open file 'out': $!"; for my $row (@input) { my @slice = @{ $row }[0,2,3]; push @output, \@slice; } 

However, I think that the last cycle will not lose too much readability if I convert it to (IMO) a more impressive form:

 push @output, [ @{$_}[0,2,3] ] for @input; 
0
source

csvkit is a tool that processes csv files and allows such operations (among other functions).

see csvcut . Its command line interface is compact and it handles many csv formats (tsv, other delimiters, encodings, escape characters, etc.).

What you requested can be done with:

 csvcut --columns 0,2,3 input.csv 
0
source

I made some mistakes, which I hope are now fixed.

 awk '{sub(/y",""/,"y\42")sub(/,2.|,3./,"")sub(/,".",.*/,"")}1' file "John \"Super\" Doe","123 ABC Street",123-456-7890 "Jane, Mary",132 CBS Street,333-111-5332 "Smith \"Jr.\", Jane",,555-876-1233 "Lee, Jack",123 Sesame St,"" 
0
source

GNU awk . Just using the wheel as a wheel. You can determine which fields should look with FPAT , for example:

 $ awk -vFPAT='[^,]+|"[^"]*"' -vOFS=, '{print $1, $3, $4}' file 

that leads to:

 "John \"Super\" Doe","123 ABC Street",123-456-7890 "Jane, Mary",132 CBS Street,333-111-5332 "Smith \"Jr.\",35,555-876-1233 "Lee, Jack",123 Sesame St,"" 

Regular expression explanation:

 [^,]+ # 1 or more occurrences of anything that not a comma, | # OR "[^"]*" # 0 or more characters unequal to '"' enclosed by '"' 

Learn more about FPAT in gawk manual

Now going through your script. It basically tries to rewrite what your fields look like. First, you separate the ",", which obviously causes some problems. He then searches for fields that are incorrectly covered by the "" symbol.

 BEGIN{OFS=FS =","} # set field sep (FS) and output field # sep to , /"/{ # for each line matching '"' for(i=1;i<=NF;i++){ # loop through fields 1 to NF if($i ~ /^"[^"]+$/){ # IF field $i start with '"', followed by # non-quotes for(x=i+1;x<=NF;x++){ # loop through ALL following fields $i=$i","$x # concatenate field $i with ALL following # fields, separated by "," if($i ~ /"+$/){ # IF field $i ends with '"' z = x - (i + 1) + 1 # z is index of field we're looking at next for(y=i+1;y<=NF;y++) $y = $(y + z) # change contents of following fields to # contents of field, z steps further # down the line break # break out of for(x) loop } } NF = NF - z # reset number of fields i=x # continue loop for(i) at index x } } print $1,$3,$4 } 

You are not using a script in this input line:

 "Smith \"Jr.\", Jane",35,,555-876-1233,"F", 

simply because $i ~ /^"[^"]+$/ fails at $ 1.

I hope you agree with me that rewriting such fields can be difficult. Moreover, it is like "O, I like awk, but I will use it like C / perl / python." Using FPAT is at least a smaller solution.

0
source

All Articles