Separate rows of columns in a row inside a data frame

I have a matrix (1000 x 2830):

9178 3574 3547 160 B_B B_B A_A 301 B_B A_B A_B 303 B_B B_B A_A 311 A_B A_B A_A 312 B_B A_B A_A 314 B_B A_B A_A 

and I want to get the following (duplicate columns and split each element of each column):

  9178 9178 3574 3574 3547 3547 160 BBBBAA 301 BBABAB 303 BBBBAA 311 ABABAA 312 BBABAA 314 BBABAA 

I tried using strsplit but got error messages because it is a matrix, not a row. Could you suggest some ideas for solving this problem?

+7
split r dataframe
source share
4 answers

The dplyr (for bind_cols ) and tidyr (for separate_ ) options are used here along with lapply from the R base. It is assumed that your data is data.frame (that is, you may need to convert it to data.frame first):

 library(dplyr) library(tidyr) lapply(names(df), function(x) separate_(df[x], x, paste0(x,"_",1:2), sep = "_" )) %>% bind_cols # X9178_1 X9178_2 X3574_1 X3574_2 X3547_1 X3547_2 #1 BBBBAA #2 BBABAB #3 BBBBAA #4 ABABAA #5 BBABAA #6 BBABAA 
+7
source share

I am biased, but I would recommend using cSplit from my splitstackshape package. Since you have rownames in your input, use as.data.table(., keep.rownames = TRUE) :

 library(splitstackshape) cSplit(as.data.table(mydf, keep.rownames = TRUE), names(mydf), "_") # rn X9178_1 X9178_2 X3574_1 X3574_2 X3547_1 X3547_2 # 1: 160 BBBBAA # 2: 301 BBABAB # 3: 303 BBBBAA # 4: 311 ABABAA # 5: 312 BBABAA # 6: 314 BBABAA 

Less picky than cSplit (but will likely be faster) will use stri_split_fixed from "stringi", for example:

 library(stringi) `dimnames<-`(do.call(cbind, lapply(mydf, stri_split_fixed, "_", simplify = TRUE)), list(rownames(mydf), rep(colnames(mydf), each = 2))) # X9178 X9178 X3574 X3574 X3547 X3547 # 160 "B" "B" "B" "B" "A" "A" # 301 "B" "B" "A" "B" "A" "B" # 303 "B" "B" "B" "B" "A" "A" # 311 "A" "B" "A" "B" "A" "A" # 312 "B" "B" "A" "B" "A" "A" # 314 "B" "B" "A" "B" "A" "A" 

If speed matters, I would suggest checking out the "iotools" package , especially the mstrsplit function. This approach will be similar to the stringi approach:

 library(iotools) `dimnames<-`(do.call(cbind, lapply(mydf, mstrsplit, "_", ncol = 2, type = "character")), list(rownames(mydf), rep(colnames(mydf), each = 2))) 

You may need to add lapply(mydf, as character) if you forgot to use stringsAsFactors = FALSE when converting from matrix to data.frame , but it should still beat even the stri_split approach.

+6
source share

There is something you can do, although it seems a little "twisted" ( yourmat is your matrix) ...:

 inter<-data.frame(t(sapply(as.vector(yourmat), function(x) { strsplit(x, "_")[[1]] })), row.names=paste0(rep(colnames(yourmat), e=nrow(yourmat)), 1:nrow(yourmat)), stringsAsFactors=F) res<-do.call("cbind", split(inter, factor(substr(row.names(inter), 1, 4), level = colnames(yourmat)))) res # 9178.X1 9178.X2 3574.X1 3574.X2 3547.X1 3547.X2 # 91781 BBBBAA # 91782 BBABAB # 91783 BBBBAA # 91784 ABABAA # 91785 BBABAA # 91786 BBABAA 

Edit
If you want row.names of res to be the same as in yourmat , you can do:

 row.names(res)<-row.names(yourmat) 

NB: If yourmat is data.frame instead of matrix , the as.vector function in the first line should be changed to unlist .

+4
source share

basic R-solution without the use of data frames:

 # split z <- unlist(strsplit(m,'_')) M <- matrix(c(z[c(T,F)],z[c(F,T)]),nrow=nrow(m)) # properly order columns i <- 1:ncol(M) M <- M[,order(c(i[c(T,F)],i[c(F,T)]))] # set dimnames rownames(M) <- rownames(m) colnames(M) <- rep(colnames(m),each=2) # 9178 9178 3574 3574 3547 3547 # 160 "B" "B" "A" "B" "B" "A" # 301 "B" "A" "A" "B" "B" "B" # 303 "B" "B" "A" "B" "B" "A" # 311 "A" "A" "A" "B" "B" "A" # 312 "B" "A" "A" "B" "B" "A" # 314 "B" "A" "A" "B" "B" "A" 

[Update] Here is a small comparative study of the proposed solutions (I did not include the cSplit solution because it was too slow):

Setup:

 m <- matrix('A_B',nrow=1000,ncol=2830) d <- as.data.frame(m, stringsAsFactors = FALSE) ##### f.mtrx <- function(m) { z <- unlist(strsplit(m,'_')) M <- matrix(c(z[c(T,F)],z[c(F,T)]),nrow=nrow(m)) # properly order columns i <- 1:ncol(M) M <- M[,order(c(i[c(T,F)],i[c(F,T)]))] # set dimnames rownames(M) <- rownames(m) colnames(M) <- rep(colnames(m),each=2) M } library(stringi) f.mtrx2 <- function(m) { z <- unlist(stri_split_fixed(m,'_')) M <- matrix(c(z[c(T,F)],z[c(F,T)]),nrow=nrow(m)) # properly order columns i <- 1:ncol(M) M <- M[,order(c(i[c(T,F)],i[c(F,T)]))] # set dimnames rownames(M) <- rownames(m) colnames(M) <- rep(colnames(m),each=2) M } ##### library(splitstackshape) f.cSplit <- function(mydf) cSplit(as.data.table(mydf, keep.rownames = TRUE), names(mydf), "_") ##### library(stringi) f.stringi <- function(mydf) `dimnames<-`(do.call(cbind, lapply(mydf, stri_split_fixed, "_", simplify = TRUE)), list(rownames(mydf), rep(colnames(mydf), each = 2))) ##### library(dplyr) library(tidyr) f.dplyr <- function(df) lapply(names(df), function(x) separate_(df[x], x, paste0(x,"_",1:2), sep = "_" )) %>% bind_cols ##### library(iotools) f.mstrsplit <- function(mydf) `dimnames<-`(do.call(cbind, lapply(mydf, mstrsplit, "_", ncol = 2, type = "character")), list(rownames(mydf), rep(colnames(mydf), each = 2))) ##### library(rbenchmark) benchmark(f.mtrx(m), f.mtrx2(m), f.dplyr(d), f.stringi(d), f.mstrsplit(d), replications = 10) 

Results:

  test replications elapsed relative user.self sys.self user.child sys.child 3 f.dplyr(d) 10 27.722 10.162 27.360 0.269 0 0 5 f.mstrsplit(d) 10 2.728 1.000 2.607 0.098 0 0 1 f.mtrx(m) 10 37.943 13.909 34.885 0.799 0 0 2 f.mtrx2(m) 10 15.176 5.563 13.936 0.802 0 0 4 f.stringi(d) 10 8.107 2.972 7.815 0.247 0 0 

In the updated test, the winner is f.mstrsplit .

+2
source share

All Articles