Matching CSV fields by name using awk

Suppose I have a CSV file with the headers of the following form:

Field1,Field2 3,262000 4,449000 5,650000 6,853000 7,1061000 8,1263000 9,1473000 10,1683000 11,1893000 

I would like to write an awk script that takes a list of comma delimited target field names, splits it into an array, and selects only those columns with the names that I specify.

This is what I have tried so far, and I have made sure that the head array contains the desired headers, and the targets array contains the targets passed by this command line.

 BEGIN{ FS="," split(target, targets, ",") } NR==1 { for (i = 1; i <= NF; i++) head[i] = $i } NR !=1{ for (i = 1; i <= NF; i++) { if (head[i] in targets){ print $i } } } 

When I invoke this script with the command

awk -v target = Field1 -f GetCol.awk Debug.csv

I am not printing anything.

+4
source share
3 answers

I realized this and am posting an answer if others are facing the same problem.

This is due to the in keyword, which I use to test array membership. This keyword checks if the operand on the left is one of the indices in the array on the right, and not on the value. The fix is ​​to create a reverse lookup array, as shown below.

 BEGIN{ OFS=FS="," split(target, t_targets, ",") for (i in t_targets) targets[t_targets[i]] = i } 
+8
source

My two cents:

 BEGIN{ OFS=FS="," split(target,fields,FS) # We just set FS don't hard the comma here for (i in fields) # Distinct var name to aviod headaches field_idx[fields[i]] = i # Reverse lookup } NR==1 { # Process header for (i=1;i<=NF;i++) # For each field header head[i] = $i # Add to hash for comparision with target next # Skip to next line } { # Don't need invert condition (used next) sep="" # Set for leading separator for (i=1;i<=NF;i++) # For each field if (head[i] in field_idx) { # Test for current field is a target field printf "%s%s",sep,$i # Print the column if matched sep=OFS # Set separator to OFS } printf "\n" # Print newline character } 
+5
source

@Sudo_O solution extension (thanks) that

  • displays fields from standard input based on command line arguments,
  • displays fields in the requested order (possibly several times),
  • displays the placeholder when the field is requested, but not found, and
  • warns of a standard error about duplicate field names in the header.
 #!/usr/bin/awk -f # Process standard input outputting named columns provided as arguments. # # For example, given foo.dat containing # abcc # 1a 1b 1c 1C # 2a 2b 2c 2C # 3a 3b 3c 3C # Running # cat foo.dat | ./namedcols cbaad # will output # 1c 1b 1a 1a d # 2c 2b 2a 2a d # 3c 3b 3a 3a d # and will warn on standard error that it # Ignored duplicate 'c' in column 4 # Notice that the requested but missing column d contains "d". # # Using awk -F feature it is possible to parse comma-separated data: # cat foo.csv | ./namedcols -F, cbaad BEGIN { for (i=1; i<ARGC; ++i) desired[i] = ARGV[i] delete ARGV } NR==1 { for (i=1; i<=NF; i++) if ($i in names) printf "Ignored duplicate '%s' in column %d\n", $i, i | "cat 1>&2" else names[$i] = i next } { for (i=1; i<ARGC; ++i) printf "%s%s", \ (i==1 ? "" : OFS), \ ((ndx = names[name = desired[i]])>0 ? $ndx: name) printf RS } 
+1
source

All Articles