Split row in each column for multiple columns

I have this table (data1) with four columns

SNP rs6576700 rs17054099 rs7730126 sample1 GG TT GG 

I need to split 2-4 columns into two columns each, so the new output has 7 columns. Like this:

 SNP rs6576700 rs6576700 rs17054099 rs17054099 rs7730126 rs7730126 sample1 GGTTCC 

With the following function, I could split all the columns at a time, but the result does not match me.

 split <- function(x){ x <- as.character(x) strsplit(as.character(x), split="-") } data2=apply(data1[,-1], 2, split) data2 $rs17054099 $rs17054099[[1]] [1] "T" "T" $rs7730126 $rs7730126[[1]] [1] "G" "G" $rs6576700 $rs6576700[[1]] [1] "C" "C" 

In the stack overflow, I found a method for converting strsplit output to dataframe, but rs numbers are not in columns in rows (I got similar output with other methods in this strsplit by row thread and distribute the results by column in data.frame )

 > n <- max(sapply(data2, length)) > l <- lapply(data2, function(X) c(X, rep(NA, n - length(X)))) > data.frame(t(do.call(cbind, l))) t.do.call.cbind..l.. rs17054099 T, T rs7730126 G, G rs2061700 C, C 

If I do not use the transpose (... (t (do.call ...) function, the output is a list that I cannot write to the file.

I would like to have a solution in R to make it part of the pipeline.

I forgot to say that I need to apply this to a million columns.

+4
source share
2 answers

This is straightforward using the splitstackshape::cSplit . Just specify the column indices in the splitCols parameter and the separator inside the sep parameter, and you are done. It will even contain your new column names so you can distinguish between them. I specified type.convert = FALSE , so T values ​​will not become TRUE . The default is wide , so you do not need to specify it.

 library(splitstackshape) cSplit(data1, 2:4, sep = "-", type.convert = FALSE) # SNP rs6576700_1 rs6576700_2 rs17054099_1 rs17054099_2 rs7730126_1 rs7730126_2 # 1: sample1 GGTTGG 

Here's the solution for the provided link using the tstrsplit function for the data.table version for GH . here we define the index, first subclass the column names, and then we number them with paste . This is a slightly more cumbersome approach, but its advantage is that it will update the original data instead of creating a copy of all the data

 library(data.table) ## V1.9.5+ indx <- names(data1)[2:4] setDT(data1)[, paste0(rep(indx, each = 2), 1:2) := sapply(.SD, tstrsplit, "-"), .SDcols = indx] data1 # SNP rs6576700 rs17054099 rs7730126 rs65767001 rs65767002 rs170540991 rs170540992 rs77301261 rs77301262 # 1: sample1 GG TT GG GGTTGG 
+7
source

Here you want to use row by row instead of columns:

 df <- rbind(c("SNP", "rs6576700", "rs17054099", "rs7730126"), c("sample1", "GG", "TT", "GG"), c("sample2", "CC", "TT", "GC")) t(apply(df[-1,], 1, function(col) unlist(strsplit(col, "-")))) # [,1] [,2] [,3] [,4] [,5] [,6] [,7] #[1,] "sample1" "G" "G" "T" "T" "G" "G" #[2,] "sample2" "C" "C" "T" "T" "G" "C" 
+1
source

All Articles