Import CSV data containing commas, thousands separators, and minus signs

R 2.13.1 on Mac OS X. I'm trying to import a data file with a dot for the thousands separator and a comma as a decimal point, and also minus minus for negative values.

Basically, I am trying to convert from:

"A|324,80|1.324,80|35,80-" 

to

  V1 V2 V3 V4 1 A 324.80 1324.8 -35.80 

Now in interactive mode the following works are performed:

 gsub("\\.","","1.324,80") [1] "1324,80" gsub("(.+)-$","-\\1", "35,80-") [1] "-35,80" 

as well as their combination:

 gsub("\\.", "", gsub("(.+)-$","-\\1","1.324,80-")) [1] "-1324,80" 

However, I cannot remove the thousands separator from read.data:

 setClass("num.with.commas") setAs("character", "num.with.commas", function(from) as.numeric(gsub("\\.", "", sub("(.+)-$","-\\1",from))) ) mydata <- "A|324,80|1.324,80|35,80-" mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas")) Warning messages: 1: In asMethod(object) : NAs introduced by coercion 2: In asMethod(object) : NAs introduced by coercion 3: In asMethod(object) : NAs introduced by coercion mytable V1 V2 V3 V4 1 A NA NA NA 

Please note that if I switch from "\\." on "," in the function, everything looks a little different:

 setAs("character", "num.with.commas", function(from) as.numeric(gsub(",", "", sub("(.+)-$","-\\1",from))) ) mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas")) mytable V1 V2 V3 V4 1 A 32480 1.3248 -3580 

I think the problem is that read.data with dec = "," converts the inbox, "to". " BEFORE calling (from, "num.with.commas"), so the input string could be, for example, "1.324.80".

I want both ((1.123.80 - "," num.with.commas ") to return -1123.80 and how (" 1.100.123,80 "," num.with.commas ") to return 1100123.80.

How can I get my num.with.commas to replace everything except the last decimal point in the input string?

Update . First, I added a negative lookahead and got () work in the console:

 setAs("character", "num.with.commas", function(from) as.numeric(gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE)) ) as("1.210.123.80-","num.with.commas") [1] -1210124 as("10.123.80-","num.with.commas") [1] -10123.8 as("10.123.80","num.with.commas") [1] 10123.8 

However, read.table still had the same problem. Adding some print () s function to my function showed that num.with.commas actually got a comma, not a period.

So my current solution is to replace with "," by "." at num.with.commas.

 setAs("character", "num.with.commas", function(from) as.numeric(gsub(",","\\.",gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE))) ) mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas")) mytable V1 V2 V3 V4 1 A 324.8 1101325 -35.8 
+4
source share
2 answers

You must first delete all periods and then change the commas to decimal points before resorting to as.numeric (). You can later control how decimal points are printed with parameters (OutDec = ","). I do not think that R uses commas as decimal separators inside, even in places where they are ordinary.

 > tst <- c("A","324,80","1.324,80","35,80-") > > as.numeric( sub("\\,", ".", sub("(.+)-$","-\\1", gsub("\\.", "", tst)) ) ) [1] NA 324.8 1324.8 -35.8 Warning message: NAs introduced by coercion 
+4
source

Here is a solution with regular expressions and permutations

 mydata <- "A|324,80|1.324,80|35,80-" # Split data mydata2 <- strsplit(mydata,"|",fixed=TRUE)[[1]] # Remove commas mydata3 <- gsub(",","",mydata2,fixed=TRUE) # Move negatives to front of string mydata4 <- gsub("^(.+)-$","-\\1",mydata3) # Convert to numeric mydata.cleaned <- c(mydata4[1],as.numeric(mydata4[2:4])) 
+1
source

All Articles