Improving text processing speed using R and data.table

As a relatively inexperienced user of the data.table package in R, I am trying to process one text column in a large number of indicator columns with 1 in each column, indicating that a specific substring is found in the row column. For example, I want to handle this:

ID String 1 a$b 2 b$c 3 c 

in it:

 ID String abc 1 a$b 1 1 0 2 b$c 0 1 1 3 c 0 0 1 

I figured out how to do this, but it takes longer to start up than I would like, and I suspect my code is inefficient. Below is a reproducible version of my dummy code. Please note that in the real data for the search there are more than 2000 substrings, each substring is about 30 characters long, and can be up to several million lines. If necessary, I can parallelize and throw a lot of resources into the problem, but I want to optimize the code as much as possible. I tried to run Rprof, which did not show obvious (for me) improvements.

 set.seed(10) elements_list <- c(outer(letters, letters, FUN = paste, sep = "")) random_string <- function(min_length, max_length, separator) { selection <- paste(sample(elements_list, ceiling(runif(1, min_length, max_length))), collapse = separator) return(selection) } dt <- data.table(id = c(1:1000), messy_string = "") dt[ , messy_string := random_string(2, 5, "$"), by = id] create_indicators <- function(search_list, searched_string) { y <- rep(0, length(search_list)) for(j in 1:length(search_list)) { x <- regexpr(search_list[j], searched_string) x <- x[1] y[j] <- ifelse(x > 0, 1, 0) } return(y) } timer <- proc.time() indicators <- matrix(0, nrow = nrow(dt), ncol = length(elements_list)) for(n in 1:nrow(dt)) { indicators[n, ] <- dt[n, create_indicators(elements_list, messy_string)] } indicators <- data.table(indicators) setnames(indicators, elements_list) dt <- cbind(dt, indicators) proc.time() - timer user system elapsed 13.17 0.08 13.29 

EDIT

Thanks for the great answers - everything is far superior to my method. The results of some speed tests are lower, with minor changes in each function for using 0L and 1L in my own code, for storing the results in separate tables by the method and for standardizing ordering. These are the elapsed times from tests at a single speed (and not median from many tests), but large runs take a lot of time.

 Number of rows in dt 2K 10K 50K 250K 1M OP 28.6 149.2 717.0 eddi 5.1 24.6 144.8 1950.3 RS 1.8 6.7 29.7 171.9 702.5 Original GT 1.4 7.4 57.5 809.4 Modified GT 0.7 3.9 18.1 115.2 473.9 GT4 0.1 0.4 2.26 16.9 86.9 

It’s pretty clear that it’s best to use a modified version of the GeekTrader approach. I'm still a little vague about what every step is taken, but I can do it at my leisure. Although some of them do not correspond to the original question, if someone wants to explain which methods GeekTrader and Ricardo Saporta work more efficiently, this will be appreciated both by me and, possibly, by anyone who visits this page in the future. I am particularly interested in understanding why some methods scale better than others.

* EDIT No. 2 ***

I tried to edit the GeekTrader answer with this comment, but this does not seem to work. I made two very minor modifications to the GT3 function: a) ordered the columns, which adds a small amount of time, and b) replace 0 and 1 0L and 1L, which speeds things up a bit. Call the resulting GT4 function. The table above has been edited to add time for GT4 with different table sizes. Obviously a winner by a mile, and he has the added advantage of being intuitive.

+6
source share
6 answers

UPDATE: VERSION 3

Found an even faster way. This feature also has high memory performance. The main reason for the previous function was slow due to copy / assignment occurring inside the lapply loop, as well as the rbinding result.

In the next version, we preliminary distribute the matrix with the appropriate size, and then change the values ​​in the corresponding coordinates, which makes it very fast compared to other versions of the cycle.

 funcGT3 <- function() { #Get list of column names in result resCol <- unique(dt[, unlist(strsplit(messy_string, split="\\$"))]) #Get dimension of result nresCol <- length(resCol) nresRow <- nrow(dt) #Create empty matrix with dimensions same as desired result mat <- matrix(rep(0, nresRow * nresCol), nrow = nresRow, dimnames = list(as.character(1:nresRow), resCol)) #split each messy_string by $ ll <- strsplit(dt[,messy_string], split="\\$") #Get coordinates of mat which we need to set to 1 coords <- do.call(rbind, lapply(1:length(ll), function(i) cbind(rep(i, length(ll[[i]])), ll[[i]] ))) #Set mat to 1 at appropriate coordinates mat[coords] <- 1 #Bind the mat to original data.table return(cbind(dt, mat)) } result <- funcGT3() #result for 1000 rows in dt result ID messy_string zn tc sv db yx st ze qs wq oe cv ut is kh kk im le qg rq po wd kc un ft ye if zl zt wy et rg iu 1: 1 zn$tc$sv$db$yx 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2: 2 st$ze$qs$wq 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3: 3 oe$cv$ut$is 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4: 4 kh$kk$im$le$qg 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5: 5 rq$po$wd$kc 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 --- 996: 996 rp$cr$tb$sa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 997: 997 cz$wy$rj$he 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 998: 998 cl$rr$bm 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 999: 999 sx$hq$zy$zd 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1000: 1000 bw$cw$pw$rq 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 

Ricardo's second test version 2 (this is for 250K lines):

 Unit: seconds expr min lq median uq max neval GT2 104.68672 104.68672 104.68672 104.68672 104.68672 1 GT3 15.15321 15.15321 15.15321 15.15321 15.15321 1 

VERSION 1 Below is version 1 of the proposed answer

 set.seed(10) elements_list <- c(outer(letters, letters, FUN = paste, sep = "")) random_string <- function(min_length, max_length, separator) { selection <- paste(sample(elements_list, ceiling(runif(1, min_length, max_length))), collapse = separator) return(selection) } dt <- data.table(ID = c(1:1000), messy_string = "") dt[ , messy_string := random_string(2, 5, "$"), by = ID] myFunc <- function() { ll <- strsplit(dt[,messy_string], split="\\$") COLS <- do.call(rbind, lapply(1:length(ll), function(i) { data.frame( ID= rep(i, length(ll[[i]])), COL = ll[[i]], VAL= rep(1, length(ll[[i]])) ) } ) ) res <- as.data.table(tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length )) dt <- cbind(dt, res) for (j in names(dt)) set(dt,which(is.na(dt[[j]])),j,0) return(dt) } create_indicators <- function(search_list, searched_string) { y <- rep(0, length(search_list)) for(j in 1:length(search_list)) { x <- regexpr(search_list[j], searched_string) x <- x[1] y[j] <- ifelse(x > 0, 1, 0) } return(y) } OPFunc <- function() { indicators <- matrix(0, nrow = nrow(dt), ncol = length(elements_list)) for(n in 1:nrow(dt)) { indicators[n, ] <- dt[n, create_indicators(elements_list, messy_string)] } indicators <- data.table(indicators) setnames(indicators, elements_list) dt <- cbind(dt, indicators) return(dt) } library(plyr) plyrFunc <- function() { indicators = do.call(rbind.fill, sapply(1:dim(dt)[1], function(i) dt[i, data.frame(t(as.matrix(table(strsplit(messy_string, split = "\\$"))))) ])) dt = cbind(dt, indicators) #dt[is.na(dt)] = 0 #THIS DOESN'T WORK. USING FOLLOWING INSTEAD for (j in names(dt)) set(dt,which(is.na(dt[[j]])),j,0) return(dt) } 

REFERENCE

 system.time(res <- myFunc()) ## user system elapsed ## 1.01 0.00 1.01 system.time(res2 <- OPFunc()) ## user system elapsed ## 21.58 0.00 21.61 system.time(res3 <- plyrFunc()) ## user system elapsed ## 1.81 0.00 1.81 

VERSION 2: Suggested by Ricardo

I post this here and not in my answer as the structure is really @ GeekTrader's -Rick _

  myFunc.modified <- function() { ll <- strsplit(dt[,messy_string], split="\\$") ## MODIFICATIONS: # using `rbindlist` instead of `do.call(rbind.. )` COLS <- rbindlist( lapply(1:length(ll), function(i) { data.frame( ID= rep(i, length(ll[[i]])), COL = ll[[i]], VAL= rep(1, length(ll[[i]])), # MODICIATION: Not coercing to factors stringsAsFactors = FALSE ) } ) ) # MODIFICATION: Preserve as matrix, the output of tapply res2 <- tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length ) # FLATTEN into a data.table resdt <- data.table(r=c(res2)) # FIND & REPLACE NA of single column resdt[is.na(r), r:=0L] # cbind with dt, a matrix, with the same attributes as `res2` cbind(dt, matrix(resdt[[1]], ncol=ncol(res2), byrow=FALSE, dimnames=dimnames(res2))) } ### Benchmarks: orig = quote({dt <- copy(masterDT); myFunc()}) modified = quote({dt <- copy(masterDT); myFunc.modified()}) microbenchmark(Modified = eval(modified), Orig = eval(orig), times=20L) # Unit: milliseconds # expr min lq median uq max # 1 Modified 895.025 971.0117 1011.216 1189.599 2476.972 # 2 Orig 1953.638 2009.1838 2106.412 2230.326 2356.802 
+3
source
  # split the `messy_string` and create a long table, keeping track of the id DT2 <- setkey(DT[, list(val=unlist(strsplit(messy_string, "\\$"))), by=list(ID, messy_string)], "val") # add the columns, initialize to 0 DT2[, c(elements_list) := 0L] # warning expected, re:adding large ammount of columns # iterate over each value in element_list, assigning 1 ass appropriate for (el in elements_list) DT2[el, c(el) := 1L] # sum by ID DT2[, lapply(.SD, sum), by=list(ID, messy_string), .SDcols=elements_list] 

Note that we messy_string columns, as it is cheaper than leaving it and then join ing on the ID to return it. If you don’t need it in the final release, just delete it above.


Landmarks:

Creating sample data:

 # sample data, using OP exmple set.seed(10) N <- 1e6 # number of rows elements_list <- c(outer(letters, letters, FUN = paste, sep = "")) messy_string_vec <- random_string_fast(N, 2, 5, "$") # Create the messy strings in a single shot. masterDT <- data.table(ID = c(1:N), messy_string = messy_string_vec, key="ID") # create the data.table 

Side note It is much faster to create random rows at once and assign the results as one column than to call a function N times and assign each one at a time.

  # Faster way to create the `messy_string` 's random_string_fast <- function(N, min_length, max_length, separator) { ints <- seq(from=min_length, to=max_length) replicate(N, paste(sample(elements_list, sample(ints)), collapse=separator)) } 

Comparison of four methods:

  • this answer is "DT.RS"
  • @eddi answer - "Plyr.eddi"
  • @ GeekTrader Answer - DT.GT
  • GeekTrader answer with some changes - DT.GT_Mod

Here is the setup:

 library(data.table); library(plyr); library(microbenchmark) # data.table method - RS usingDT.RS <- quote({DT <- copy(masterDT); DT2 <- setkey(DT[, list(val=unlist(strsplit(messy_string, "\\$"))), by=list(ID, messy_string)], "val"); DT2[, c(elements_list) := 0L] for (el in elements_list) DT2[el, c(el) := 1L]; DT2[, lapply(.SD, sum), by=list(ID, messy_string), .SDcols=elements_list]}) # data.table method - GeekTrader usingDT.GT <- quote({dt <- copy(masterDT); myFunc()}) # data.table method - GeekTrader, modified by RS usingDT.GT_Mod <- quote({dt <- copy(masterDT); myFunc.modified()}) # ply method from below usingPlyr.eddi <- quote({dt <- copy(masterDT); indicators = do.call(rbind.fill, sapply(1:dim(dt)[1], function(i) dt[i, data.frame(t(as.matrix(table(strsplit(messy_string, split = "\\$"))))) ])); dt = cbind(dt, indicators); dt[is.na(dt)] = 0; dt }) 

Below are the test results:

 microbenchmark( usingDT.RS=eval(usingDT.RS), usingDT.GT=eval(usingDT.GT), usingDT.GT_Mod=eval(usingDT.GT_Mod), usingPlyr.eddi=eval(usingPlyr.eddi), times=5L) On smaller data: N = 600 Unit: milliseconds expr min lq median uq max 1 usingDT.GT 1189.7549 1198.1481 1200.6731 1202.0972 1203.3683 2 usingDT.GT_Mod 581.7003 591.5219 625.7251 630.8144 650.6701 3 usingDT.RS 2586.0074 2602.7917 2637.5281 2819.9589 3517.4654 4 usingPlyr.eddi 2072.4093 2127.4891 2225.5588 2242.8481 2349.6086 N = 1,000 Unit: seconds expr min lq median uq max 1 usingDT.GT 1.941012 2.053190 2.196100 2.472543 3.096096 2 usingDT.RS 3.107938 3.344764 3.903529 4.010292 4.724700 3 usingPlyr 3.297803 3.435105 3.625319 3.812862 4.118307 N = 2,500 Unit: seconds expr min lq median uq max 1 usingDT.GT 4.711010 5.210061 5.291999 5.307689 7.118794 2 usingDT.GT_Mod 2.037558 2.092953 2.608662 2.638984 3.616596 3 usingDT.RS 5.253509 5.334890 6.474915 6.740323 7.275444 4 usingPlyr.eddi 7.842623 8.612201 9.142636 9.420615 11.102888 N = 5,000 expr min lq median uq max 1 usingDT.GT 8.900226 9.058337 9.233387 9.622531 10.839409 2 usingDT.GT_Mod 4.112934 4.293426 4.460745 4.584133 6.128176 3 usingDT.RS 8.076821 8.097081 8.404799 8.800878 9.580892 4 usingPlyr.eddi 13.260828 14.297614 14.523016 14.657193 16.698229 # dropping the slower two from the tests: microbenchmark( usingDT.RS=eval(usingDT.RS), usingDT.GT=eval(usingDT.GT), usingDT.GT_Mod=eval(usingDT.GT_Mod), times=6L) N = 10,000 Unit: seconds expr min lq median uq max 1 usingDT.GT_Mod 8.426744 8.739659 8.750604 9.118382 9.848153 2 usingDT.RS 15.260702 15.564495 15.742855 16.024293 16.249556 N = 25,000 ... (still running) 

-----------------

Functions used in benchmarking:

  # original random string function random_string <- function(min_length, max_length, separator) { selection <- paste(sample(elements_list, ceiling(runif(1, min_length, max_length))), collapse = separator) return(selection) } # GeekTrader function myFunc <- function() { ll <- strsplit(dt[,messy_string], split="\\$") COLS <- do.call(rbind, lapply(1:length(ll), function(i) { data.frame( ID= rep(i, length(ll[[i]])), COL = ll[[i]], VAL= rep(1, length(ll[[i]])) ) } ) ) res <- as.data.table(tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length )) dt <- cbind(dt, res) for (j in names(dt)) set(dt,which(is.na(dt[[j]])),j,0) return(dt) } # Improvements to @GeekTrader `myFunc` -RS ' myFunc.modified <- function() { ll <- strsplit(dt[,messy_string], split="\\$") ## MODIFICATIONS: # using `rbindlist` instead of `do.call(rbind.. )` COLS <- rbindlist( lapply(1:length(ll), function(i) { data.frame( ID= rep(i, length(ll[[i]])), COL = ll[[i]], VAL= rep(1, length(ll[[i]])), # MODICIATION: Not coercing to factors stringsAsFactors = FALSE ) } ) ) # MODIFICATION: Preserve as matrix, the output of tapply res2 <- tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length ) # FLATTEN into a data.table resdt <- data.table(r=c(res2)) # FIND & REPLACE NA of single column resdt[is.na(r), r:=0L] # cbind with dt, a matrix, with the same attributes as `res2` cbind(dt, matrix(resdt[[1]], ncol=ncol(res2), byrow=FALSE, dimnames=dimnames(res2))) } ### Benchmarks comparing the two versions of GeekTrader function: orig = quote({dt <- copy(masterDT); myFunc()}) modified = quote({dt <- copy(masterDT); myFunc.modified()}) microbenchmark(Modified = eval(modified), Orig = eval(orig), times=20L) # Unit: milliseconds # expr min lq median uq max # 1 Modified 895.025 971.0117 1011.216 1189.599 2476.972 # 2 Orig 1953.638 2009.1838 2106.412 2230.326 2356.802 
+4
source

Here the speed is ~ 10x faster with rbind.fill .

 library(plyr) indicators = do.call(rbind.fill, sapply(1:dim(dt)[1], function(i) dt[i, data.frame(t(as.matrix(table(strsplit(messy_string, split = "\\$"))))) ])) dt = cbind(dt, indicators) # dt[is.na(dt)] = 0 # faster NA replace (thanks geektrader) for (j in names(dt)) set(dt, which(is.na(dt[[j]])), j, 0L) 
+1
source

Here is an approach using rapply and table . I'm sure there will be a slightly faster approach than using the table here, but it's still a little faster than myfunc.Modified from myfunc.Modified 's answer;

 # a copy with enough column pointers available dtr <- alloc.col(copy(dt) ,1000L) rapplyFun <- function(){ ll <- strsplit(dtr[, messy_string], '\\$') Vals <- rapply(ll, classes = 'character', f= table, how = 'replace') Names <- unique(rapply(Vals, names)) dtr[, (Names) := 0L] for(ii in seq_along(Vals)){ for(jj in names(Vals[[ii]])){ set(dtr, i = ii, j = jj, value =Vals[[ii]][jj]) } } } microbenchmark(myFunc.modified(), rapplyFun(),times=5) Unit: milliseconds # expr min lq median uq max neval # myFunc.modified() 395.1719 396.8706 399.3218 400.6353 401.1700 5 # rapplyFun() 308.9103 309.5763 309.9368 310.2971 310.3463 5 
+1
source

Here is a slightly newer approach using cSplit_e() from the splitstackshape package.

 library(splitstackshape) cSplit_e(dt, split.col = "String", sep = "$", type = "character", mode = "binary", fixed = TRUE, fill = 0) # ID String String_a String_b String_c #1 1 a$b 1 1 0 #2 2 b$c 0 1 1 #3 3 c 0 0 1 
+1
source

Here's another solution that builds a sparse matrix object instead of what you have. This saves a lot of time and memory.

It produces ordered results and even with conversion to data.table faster than GT3 with 0L and 1L and without reordering (this may be due to the fact that I use another method to achieve the required coordinates - t go through the GT3 algorithm), however if you do not convert and save it as a sparse matrix, it is about 10-20x faster than GT3 (and has much less memory).

 library(Matrix) strings = strsplit(dt$messy_string, split = "$", fixed = TRUE) element.map = data.table(el = elements_list, n = seq_along(elements_list), key = "el") tmp = data.table(n = seq_along(strings), each = unlist(lapply(strings, length))) rows = tmp[, rep(n, each = each), by = n][, V1] cols = element.map[J(unlist(strings))][,n] dt.sparse = sparseMatrix(rows, cols, x = 1, dims = c(max(rows), length(elements_list))) # optional, should be avoided until absolutely necessary dt = cbind(dt, as.data.table(as.matrix(dt.sparse))) setnames(dt, c('id', 'messy_string', elements_list)) 

The idea is to split into rows and then use data.table as a map object to map each substring to its correct column position. From there, it's just a matter of correctly defining the rows and populating the matrix.

0
source

All Articles