A quick way to split a string and convert to long format in data.table

I do the following

library(data.table) library(stringr) dt <- data.table(string_column = paste(sample(c(letters, " "), 500000, replace = TRUE) , sample(c(letters, " "), 500000, replace = TRUE) , sample(1:500000) , sep = " "), key = "string_column") split_res <- dt[, list(name = unlist(str_split(string_column, '\\s+'))), by = string_column] 

For real data, approx. 1 hour to process dt (10M lines) and create split_res (18M lines) Out of curiosity - is there a way to speed up the process? Maybe unlist + str_split is the wrong way to do this?

+7
substring r data.table data-manipulation
source share
1 answer

You will get great acceleration if you just use the str_split() ditch from "stringr" and just use strsplit() .

 fun1 <- function() dt[, list(name = unlist(str_split(string_column, '\\s+'))), by = string_column] fun2 <- function() dt[, list(name = unlist(strsplit(string_column, '\\s+'))), by = string_column] system.time(fun1()) # user system elapsed # 172.41 0.05 172.82 system.time(fun2()) # user system elapsed # 11.22 0.01 11.23 

Whether this will reduce processing time from one hour to 4 minutes or not, I'm not sure. But at least you won't have to forget to insert these annoying underscores in the names of your functions :-)


If you can split into a fixed search pattern, you can use the fixed = TRUE argument, which will give you another significant speed boost.


Another thing to consider is to perform this process manually:

 x <- strsplit(dt$string_column, "\\s+") DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))] DT[, name := unlist(x, use.names = FALSE)] DT 

With your sample data:

 fun4 <- function() { x <- strsplit(dt$string_column, "\\s+") DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))] DT[, name := unlist(x, use.names = FALSE)] DT } # user system elapsed # 1.79 0.01 1.82 

However, the answer does not match what I get with fun2() , but that is because you have duplicate values ​​in "string_column". If you add the id column and do the same, you will get the same results.

+10
source share

All Articles