How to prevent data.table to force numeric variables into character variables without manually specifying?

Consider the following data set:

dt <- structure(list(lllocatie = structure(c(1L, 6L, 2L, 4L, 3L), .Label = c("Assen", "Oosterwijtwerd", "Startenhuizen", "t-Zandt", "Tjuchem", "Winneweer"), class = "factor"), lat = c(52.992, 53.32, 53.336, 53.363, 53.368), lon = c(6.548, 6.74, 6.808, 6.765, 6.675), mag.cat = c(3L, 2L, 1L, 2L, 2L), places = structure(c(2L, 4L, 5L, 6L, 3L), .Label = c("", "Amen,Assen,Deurze,Ekehaar,Eleveld,Geelbroek,Taarlo,Ubbena", "Eppenhuizen,Garsthuizen,Huizinge,Kantens,Middelstum,Oldenzijl,Rottum,Startenhuizen,Toornwerd,Westeremden,Zandeweer", "Loppersum,Winneweer", "Oosterwijtwerd", "t-Zandt,Zeerijp"), class = "factor")), .Names = c("lllocatie", "lat", "lon", "mag.cat", "places"), class = c("data.table", "data.frame"), row.names = c(NA, -5L)) 

When I want to split the rows in the last column into separate rows, I use (with data.table version 1.9.5+):

 dt.new <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by=list(lllocatie,lat,lon,mag.cat)] 

However, when I use:

 dt.new2 <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by=lllocatie] 

I get the same result, except that all columns are forced into character variables. The problem is that for small datasets it is not a big problem to specify variables that should not be split by by argument, but for datasets with many columns / variables. I know it is possible to do this with the splitstackshape package (as @ColonelBeauvel mentioned in his answer ), but I am looking for data.table as I want to associate more operations with this.

How can I prevent this by not manually specifying variables that should not be broken by ?

+5
source share
2 answers

Two solutions with data.table :

1 . Use the type.convert=TRUE argument inside tstrsplit() suggested by @Arun:

 dt.new1 <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE, type.convert=TRUE))), by=lllocatie] 

2 : use setdiff(names(dt),"places") in the by argument suggested by @Frank:

 dt.new2 <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by=setdiff(names(dt),"places")] 

Both approaches give the same result:

 > identical(dt.new1,dt.new2) [1] TRUE 

The advantage of the second solution is that when you have more columns with columns with string values, only the one you specify in setdiff(names(dt),"places") is setdiff(names(dt),"places") (suppose you want only this particular one, in in this case places ) to Crack). The splitstackshape package also offers this benefit.

+6
source

This is exactly the job for cSplit from the splitstackshape package:

 library(splitstackshape) cSplit(dt, 'places', ',') 
+5
source

All Articles