You will get great acceleration if you just use the str_split() ditch from "stringr" and just use strsplit() .
fun1 <- function() dt[, list(name = unlist(str_split(string_column, '\\s+'))), by = string_column] fun2 <- function() dt[, list(name = unlist(strsplit(string_column, '\\s+'))), by = string_column] system.time(fun1()) # user system elapsed # 172.41 0.05 172.82 system.time(fun2()) # user system elapsed # 11.22 0.01 11.23
Whether this will reduce processing time from one hour to 4 minutes or not, I'm not sure. But at least you won't have to forget to insert these annoying underscores in the names of your functions :-)
If you can split into a fixed search pattern, you can use the fixed = TRUE argument, which will give you another significant speed boost.
Another thing to consider is to perform this process manually:
x <- strsplit(dt$string_column, "\\s+") DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))] DT[, name := unlist(x, use.names = FALSE)] DT
With your sample data:
fun4 <- function() { x <- strsplit(dt$string_column, "\\s+") DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))] DT[, name := unlist(x, use.names = FALSE)] DT } # user system elapsed # 1.79 0.01 1.82
However, the answer does not match what I get with fun2() , but that is because you have duplicate values ββin "string_column". If you add the id column and do the same, you will get the same results.
A5C1D2H2I1M1N2O1R2T1
source share