A quick way to split a string and convert to long format in data.table

Question

A quick way to split a string and convert to long format in data.table

I do the following

library(data.table) library(stringr) dt <- data.table(string_column = paste(sample(c(letters, " "), 500000, replace = TRUE) , sample(c(letters, " "), 500000, replace = TRUE) , sample(1:500000) , sep = " "), key = "string_column") split_res <- dt[, list(name = unlist(str_split(string_column, '\\s+'))), by = string_column]

For real data, approx. 1 hour to process dt (10M lines) and create split_res (18M lines) Out of curiosity - is there a way to speed up the process? Maybe unlist + str_split is the wrong way to do this?

+7

substring r data.table data-manipulation

Rinatm Mar 27 '14 at 4:20

source share

1 answer

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2014-03-27T09:29:46+0000

You will get great acceleration if you just use the str_split() ditch from "stringr" and just use strsplit() .

 fun1 <- function() dt[, list(name = unlist(str_split(string_column, '\\s+'))), by = string_column] fun2 <- function() dt[, list(name = unlist(strsplit(string_column, '\\s+'))), by = string_column] system.time(fun1()) # user system elapsed # 172.41 0.05 172.82 system.time(fun2()) # user system elapsed # 11.22 0.01 11.23

Whether this will reduce processing time from one hour to 4 minutes or not, I'm not sure. But at least you won't have to forget to insert these annoying underscores in the names of your functions :-)

If you can split into a fixed search pattern, you can use the fixed = TRUE argument, which will give you another significant speed boost.

Another thing to consider is to perform this process manually:

 x <- strsplit(dt$string_column, "\\s+") DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))] DT[, name := unlist(x, use.names = FALSE)] DT

With your sample data:

 fun4 <- function() { x <- strsplit(dt$string_column, "\\s+") DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))] DT[, name := unlist(x, use.names = FALSE)] DT } # user system elapsed # 1.79 0.01 1.82

However, the answer does not match what I get with fun2() , but that is because you have duplicate values in "string_column". If you add the id column and do the same, you will get the same results.

A quick way to split a string and convert to long format in data.table

More articles: