Creating new columns by splitting a variable into many variables (in R)

I want to create new columns by dividing a vector in a data frame.

I have a data frame like this:

YEAR Variable1 Variable2 2009 000000 00000001 2010 000000 00000001 2011 000000 00000001 2009 000000 00000002 2010 000000 00000002 2009 000000 00000003 ... 2009 100000 10000001 2010 100000 10000001 ... 2009 100000 10000011 .... 

As you can see, variable 2 is associated with variable 1 (Variable2 = Variable1 + last two digits: for example, 01, 02, 03 ... with the indication of subcategories). I want to split Variable2 in a variety of variables, such as subcategories. The result should be:

 YEAR Variable1 Variable2 Variable3 Variable4 ... 2009 000000 00000001 0 0 2010 000000 00000001 0 0 2011 000000 00000001 0 0 2009 000000 0 00000002 0 2010 000000 0 00000002 0 2009 000000 0 0 00000003 ... 2009 100000 10000001 0 0 2010 100000 10000001 0 0 ... 2009 100000 0 0 0 ... 10000011 

What will you do? I thought I should try recoding Variable2 in a loop. I tried to manipulate the strings, but I did not solve the problem.

+7
string split r dataframe
source share
6 answers

That will work. First, create the data.

 values <- paste0("0000000", 1:4) library(data.table) dt <- data.table(val = sample(values, 10, replace = TRUE)) 

And for a loop, enough to define new columns.

 for(level_var in dt[, unique(val)]){ dt[, eval(level_var) := ifelse(val == level_var, level_var, 0)] } 
+4
source share

Using reshape2 . One line solution. Another line if we want to remove the NA values.

 library(reshape2) df <- data.frame(YEAR=c(2009,2010,2011,2009,2010,2009,2009,2010,2009), Var1=c('000000','000000','000000','000000','000000','000000','100000','100000','100000'), Var2=c('0000001','0000001','0000001','0000002','0000002','0000003','1000001','1000001','1000011')) df <- dcast(df, YEAR + Var1 + Var2 ~ Var2, value.var = "Var2")[, -3] df[is.na(df)] <- 0 

Result:

  YEAR Var1 0000001 0000002 0000003 1000001 1000011 1 2009 000000 0000001 0 0 0 0 2 2009 000000 0 0000002 0 0 0 3 2009 000000 0 0 0000003 0 0 4 2009 100000 0 0 0 1000001 0 5 2009 100000 0 0 0 0 1000011 6 2010 000000 0000001 0 0 0 0 7 2010 000000 0 0000002 0 0 0 8 2010 100000 0 0 0 1000001 0 9 2011 000000 0000001 0 0 0 0 
+1
source share

Here is another suggestion. The code is slightly longer, but I think this is a trick, and I hope it is easy to understand. I assume that the original data is stored in a tab delimited file named "data.dat". The code output is stored in the matrix "new_matrix". Entries are characters, but there should be no problem converting them to integers.

 data <- read.table('data.dat', sep='\t', header = TRUE, colClasses = "character") var2 <- data[3] nc <- nchar(var2[1,1]) last2 <-substr(var2[,1],nc-1,nc) subcat <-levels(factor(last2)) mrows <- nrow(data) mcols <- length(subcat) varnames <-paste0("Variable",as.character(c(1:(mcols+1)))) new_matrix <- matrix(paste(replicate(nc,"0"),collapse=""),nrow=mrows,ncol=mcols+2) colnames(new_matrix) <- c("YEAR",varnames) new_matrix[,1]<-data[,1] new_matrix[,2]<-data[,2] for (i in 1:mcols) { relevant_rows <- which(last2 == subcat[i]) new_matrix[relevant_rows,i+2]<-data[relevant_rows,3] } 

Hope this helps.

+1
source share

Here is another approach. Note that I choose that subcat's biasing variables into binary indicator variables reduce redundancy:

Input:

 data <- read.table(header=TRUE, text=' year var1 var2 2009 000000 00000001 2010 000000 00000001 2009 000000 00000002 2010 000000 00000002 2009 000000 00000003 2009 100000 10000001 2009 100000 10000004 2010 100000 10000010 ', colClasses = c('character', 'character', 'character')) 

Simplification of var2 column:

 subCat <- function(s) { substr(s, nchar(s) - 1, nchar(s)) } data$var2 <- subCat(data$var2) 

Creating mannequins:

Method 1:

 t <- table(1:length(data$var2), data$var2) data <- cbind(data, as.data.frame.matrix(t)) data$var2 <- NULL 

Output:

  year var1 01 02 03 04 10 1 2009 000000 1 0 0 0 0 2 2010 000000 1 0 0 0 0 3 2009 000000 0 1 0 0 0 4 2010 000000 0 1 0 0 0 5 2009 000000 0 0 1 0 0 6 2009 100000 1 0 0 0 0 7 2009 100000 0 0 0 1 0 8 2010 100000 0 0 0 0 1 

==================================================== ==========

Method 2:

 library(dummies) data$var2 <- subCat(data$var2) data3 <- cbind(data, dummy(data$var2)) data3$var2 = NULL 

Output:

  year var1 data01 data02 data03 data04 data10 1 2009 000000 1 0 0 0 0 2 2010 000000 1 0 0 0 0 3 2009 000000 0 1 0 0 0 4 2010 000000 0 1 0 0 0 5 2009 000000 0 0 1 0 0 6 2009 100000 1 0 0 0 0 7 2009 100000 0 0 0 1 0 8 2010 100000 0 0 0 0 1 

==================================================== ==========

Method 3:

 dummies <- sapply(unique(data$var2), function(x) as.numeric(data$var2 == x)) data <- cbind(data, dummies) data$var2 = NULL 

Output:

  year var1 X01 X02 X03 X04 X10 1 2009 000000 1 0 0 0 0 2 2010 000000 1 0 0 0 0 3 2009 000000 0 1 0 0 0 4 2010 000000 0 1 0 0 0 5 2009 000000 0 0 1 0 0 6 2009 100000 1 0 0 0 0 7 2009 100000 0 0 0 1 0 8 2010 100000 0 0 0 0 1 
0
source share
 library(dplyr) library(reshape2) df <- data.frame(YEAR=c(2009,2010,2011,2009,2010,2009,2009,2010,2009), Var1=c('000000','000000','000000','000000','000000','000000','100000','100000','100000'), Var2=c('0000001','0000001','0000001','0000002','0000002','0000003','1000001','1000001','1000011')) df <- mutate(df, tag=paste(YEAR, Var1, Var2, sep='-')) df <- dcast(df, YEAR + Var1 + tag ~ Var2, fun.aggregate = NULL) df$tag <- NULL df <- apply(df, 2, function(x) sub('^(.*)-(.*)-', '', x)) df[is.na(df)] <- 0 df <- as.data.frame(df) 

Output:

  YEAR Var1 0000001 0000002 0000003 1000001 1000011 1 2009 000000 0000001 0 0 0 0 2 2009 000000 0 0000002 0 0 0 3 2009 000000 0 0 0000003 0 0 4 2009 100000 0 0 0 1000001 0 5 2009 100000 0 0 0 0 1000011 6 2010 000000 0000001 0 0 0 0 7 2010 000000 0 0000002 0 0 0 8 2010 100000 0 0 0 1000001 0 9 2011 000000 0000001 0 0 0 0 
0
source share

Thanks for all these answers. I found a solution by combining Michele Uswelli's answer with a comment on his Synergist answer. I also found out more about data.table

 NbTabelle <- data.table(val=Netz) attach(NbTabelle) for(level_var in namesvec){ NbTabelle[, eval(level_var) := ifelse(substr(eval(val), 7, 8) == level_var, val, 0)] } 

Where namesvec is the vector of variable names that I created from the previous generated tables, leaving aside the variable val. I appreciated the synergist code commonality, but for my purpose I needed only the last two digits.

0
source share

All Articles