BSD sed cannot replace all ", 0",

I am a little versed in the differences between GNU sed and BSD sed. Unfortunately, I do not have a Linux machine, only a Mac is available.

I have a large data file in csv format, separated by a comma. Even the first two lines of the file are too large to publish here, you can find the first two lines here .

I need to replace the values ​​"0.8.9", -999, -999.0 with "NA", because these values ​​are codes for missing values.

I used the following sed command in a bash prompt

sed -e 's/\-999\.?\0?/NA/g' \ -e 's/\-999/NA/g' \ -e 's/,9,/,NA,/g' \ -e 's/,8,/,NA,/g' \ -e 's/,0,/,NA,/g' \ firsttwolines.csv 

The result looks great, but there is another 0 . How to fix it? And how to put it in a bash script? Is there a better way to accomplish this task?

+4
source share
3 answers

I think awk would be the best choice for this task (tested with gawk and nawk):

 awk '{ for(i=1; i<=NF; i++) if($i == 0 || $i == 8 || $i == 9 || $i == -999) $i = "NA" } 1' FS=, OFS=, firsttwolines.csv 

During the loop, it goes through each field and checks it for equality with the NA list, if it is found true, the field is replaced by NA , note that awk converts the field to a number before testing. 1 at the end calls the default block { print $0 } .

0
source

In my experience, when dealing with the CSV file format, text scanning tools such as sed, awk are poorly equipped to handle all angular cases. I know that you specifically asked for solutions in sed, but this does not work well. I recommend a language that offers robust processing of CSV files like Python or Tcl (there are more, but this is what I know). Here is the solution in Python:

 # csvreplace.py import sys import csv if __name__ == '__main__': infilename = sys.argv[1] outfilename = sys.argv[2] with open(infilename) as infile, open(outfilename, 'w') as outfile: csvreader = csv.reader(infile) csvwriter = csv.writer(outfile) na_list = ['0', '8', '9', '-999', '-999.0'] for row in csvreader: row = [col in na_list and 'NA' or col for col in row] csvwriter.writerow(row) 

You can use it in your bash script like:

 python csvreplace.py data.csv out.csv 
+1
source

With sed, try:

 sed -e 's/\-999\(\.0\)*/NA/g' -e :a -e 's/,[089],/,NA,/; ta' file 
0
source

All Articles