R-splitting data frame factors with NA

I have a dataframe (df) that is imported from the Internet. I am interested in the following column (colname) df. The colname elements are recognized as factors. A sample from df is similar below, which also includes "NA" s:

colname 57 +0.10 55 NA 57,5 +2.00 56,5 +0.50 56,5 58 

I would like to split colname by "+" and get 3 numeric columns, as shown below. Desired Result:

 colname1 colname2 total 57.00 0.10 57.10 55.00 0.00 55.00 NA NA NA 57.50 2.00 59.50 56.50 0.50 57.00 56.50 0.00 56.50 58.00 0.00 58.00 

which is also a data frame, and all columns are numeric. However, I am stuck with this problem. No matter what I do, I cannot get the desired result. Errors are mainly caused by the data types "NA" and "factor". I will be very happy for any help. Many thanks.

+5
source share
3 answers

I would replace "," with "." Using sub . ( read.table/read.csv also has a dec parameter). Using cSplit from splitstackshape , split the columns into two, specifying sep as splitstackshape The output will be data.table . Create the Total column using rowSums . If you want to return NA for strings, all NAs , this is possible (one of the options shown in the second solution)

 df$colname <- sub(',', '.', df$colname) library(splitstackshape) dt <- cSplit(df, 'colname', '+') dt[, Total:=rowSums(.SD,na.rm=TRUE)][] 

Or using base R , split the column ("colname") with strsplit . The result will be a list. Convert "character" to "numeric", pad NAs to get the same length in all list items and rbind ( df2 <- do.call(...,) ). Create the “Total” column on rowSums , change the item to NA for those NAs in both columns.

  lst <- lapply(strsplit(df$colname, '[+]'), as.numeric) df2 <- do.call(rbind.data.frame, lapply(lst, `length<-`, max(sapply(lst, length)))) names(df2) <- paste0('colname', 1:2) df2$Total <- (NA^!rowSums(!is.na(df2)))*rowSums(df2, na.rm=TRUE) df2 # colname1 colname2 Total #1 57.0 0.1 57.1 #2 55.0 NA 55.0 #3 NA NA NA #4 57.5 2.0 59.5 #5 56.5 0.5 57.0 #6 56.5 NA 56.5 #7 58.0 NA 58.0 

Or in this case, you can also use eval(parse( , which avoids changing the value 0 to NA

  df2$Total <- unname(sapply(df$colname, function(x) eval(parse(text=x)))) 

Update

If you need to replace NA with 0 with "colname2"

 df2$colname2[with(df2, is.na(colname2) & !is.na(colname1))] <- 0 df2 # colname1 colname2 Total #1 57.0 0.1 57.1 #2 55.0 0.0 55.0 #3 NA NA NA #4 57.5 2.0 59.5 #5 56.5 0.5 57.0 #6 56.5 0.0 56.5 #7 58.0 0.0 58.0 

data

  df <- structure(list(colname = structure(c(4L, 1L, NA, 5L, 3L, 2L, 6L), .Label = c("55", "56,5", "56,5 +0.50", "57 +0.10", "57,5 +2.00", "58"), class = "factor")), .Names = "colname", row.names = c(NA, -7L), class = "data.frame") 
+6
source

Here is another idea. You can take a step back and use the many arguments to read.table() . Here we can use sep = "+" , since the function will separate the spaces between the columns.

 df <- read.table(text = x, col.names = c("V1", "V2"), colClasses = c(V1 = "numeric", V2 = "character"), dec = ",", skip = 1, fill = TRUE, sep = "+" ) 

So V2 will be a column of characters with removed + signs. Thus, there are a few more steps to make the columns numeric and align NA. For this we can use

 within(df, { V2 <- replace(type.convert(V2), !nzchar(V2), 0) is.na(V2) <- is.na(V1) V3 <- V1 + V2 }) # V1 V2 V3 # 1 57.0 0.1 57.1 # 2 55.0 0.0 55.0 # 3 NA NA NA # 4 57.5 2.0 59.5 # 5 56.5 0.5 57.0 # 6 56.5 0.0 56.5 # 7 58.0 0.0 58.0 

where x is

 "colname\n57 +0.10\n55\nNA\n57,5 +2.00\n56,5 +0.50\n56,5\n58" 

Update / improvement: You can also do this with fread() and the new tstrsplit() function, available in 1.9.5. It also allows you to read a table from a file without first creating the data.frame file.

 library(data.table) fread(x, sep = "\n")[, tstrsplit(colname, "\\s?[+]", fill="0")][, lapply(.SD, function(x) type.convert(chartr(",", ".", x), as.is=TRUE)) ][, V3 := rowSums(.SD)][] # V1 V2 V3 # 1: 57.0 0.1 57.1 # 2: 55.0 0.0 55.0 # 3: NA 0.0 NA # 4: 57.5 2.0 59.5 # 5: 56.5 0.5 57.0 # 6: 56.5 0.0 56.5 # 7: 58.0 0.0 58.0 
+5
source

Using dplyr and tidyr :

 library(tidyr) library(dplyr) df %>% separate(colname, c("colname1", "colname2"), extra = "drop", convert = TRUE, '[+]') %>% mutate(colname1 = as.numeric(gsub(",", ".", colname1)), colname2 = ifelse(is.na(colname1), NA, ifelse(is.na(colname2), 0, colname2)), total = colname1 + colname2) 

You are getting:

 # colname1 colname2 total #1 57.0 0.1 57.1 #2 55.0 0.0 55.0 #3 NA NA NA #4 57.5 2.0 59.5 #5 56.5 0.5 57.0 #6 56.5 0.0 56.5 #7 58.0 0.0 58.0 

So you have 0s instead of NA in colname2 when colname1 not NA (as shown in your desired output)

+4
source

Source: https://habr.com/ru/post/1212316/


All Articles