Background . I want to extract specific columns from a csv file. The csv file is separated by a comma, uses double quotes as a text classifier (optional, but when the field contains special characters, a qualifier will be displayed - see the Example) and uses a backslash as an escape character. Some fields may be empty.
An example of input and the desired output . For example, I want columns 1, 3, and 4 to appear in the output file. The final output of the columns from the csv file must match the format of the source file. No escape characters should be removed or extra quotes added, etc.
Enter
"John \"Super\" Doe",25,"123 ABC Street",123-456-7890,"M",A "Jane, Mary","",132 CBS Street,333-111-5332,"F",B "Smith \"Jr.\", Jane",35,,555-876-1233,"F", "Lee, Jack",22,123 Sesame St,"","M",D
Desired output
"John \"Super\" Doe","123 ABC Street",123-456-7890 "Jane, Mary",132 CBS Street,333-111-5332 "Smith \"Jr.\", Jane",,555-876-1233 "Lee, Jack",123 Sesame St,""
Preview Script (awk) . Below is a preliminary Script, which I found that works for the most part, but does not work in one specific instance, which I noticed, and maybe I have not seen and thought about it yet
#!/usr/xpg4/bin/awk -f BEGIN{ OFS = FS = "," } /"/{ for(i=1;i<=NF;i++){ if($i ~ /^"[^"]+$/){ for(x=i+1;x<=NF;x++){ $i=$i","$x if($i ~ /"+$/){ z = x - (i + 1) + 1 for(y=i+1;y<=NF;y++) $y = $(y + z) break } } NF = NF - z i=x } } print $1,$3,$4 }
The above seems to work well until it hits a field containing both hidden double quotes and a comma. In this case, parsing will be disabled and the output will be incorrect.
Question / Comments . I read that awk is not the best option for parsing through csv files, and perl is suggested. However, I do not know perl. I found some examples of perl scripts, but they do not give the desired result that I am looking for, and I do not know how easy it is to edit scripts for what I want.
As for awk, I am familiar with it and sometimes use basic functions, but I do not know many advanced functions, such as some of the commands used in Script above. Is my desired result possible just using awk? If so, is it possible to modify the Script above to fix the problem I am facing? Can someone explain line by line what exactly Script does?
Any help would be appreciated, thanks!