Can I use the paste0 () parameter in R to make this function execute as fast as the original Python example?

I'm trying to play around with some R code I recently found that mimics parts of the Norvig spell checker written in Python; In particular, I am trying to develop the correct way to implement the edit2 function in R:

 def splits(word): return [(word[:i], word[i:]) for i in range(len(word)+1)] def edits1(word): pairs = splits(word) deletes = [a+b[1:] for (a, b) in pairs if b] transposes = [a+b[1]+b[0]+b[2:] for (a, b) in pairs if len(b) > 1] replaces = [a+c+b[1:] for (a, b) in pairs for c in alphabet if b] inserts = [a+c+b for (a, b) in pairs for c in alphabet] return set(deletes + transposes + replaces + inserts) def edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1)) 

However, in my estimation, generating thousands of small lines in R using paste0 (or str_c from stringr or stri_join from stringi) results in code that is about 10x (or ~ 100x, or ~ 50x) slower than the Python implementation shown by Norvig. (Yes, the functions stringr and stringi are even more interesting than using paste0 .) My questions (C # 3, which is the main one I want to resolve):

  • Am I doing it right (is this the code "right")?

  • If so, is this a known R problem (extremely slow string concatenation)?

  • Can I do something to make this much faster (at least one or more orders of magnitude) without , rewriting the entire function in Rcpp11 or something like that?

Here is my R code that I came up with for the edit2 function:

 # 1. generate a list of all binary splits of a word binary.splits <- function(w) { n <- nchar(w) lapply(0:n, function(x) c(stri_sub(w, 0, x), stri_sub(w, x + 1, n))) } # 2. generate a list of all bigrams for a word bigram.unsafe <- function(word) sapply(2:nchar(word), function(i) substr(word, i-1, i)) bigram <- function(word) if (nchar(word) > 1) bigram.unsafe(word) else word # 3. four edit types: deletion, transposition, replacement, and insertion alphabet = letters deletions <- function(splits) if (length(splits) > 1) { sapply(1:(length(splits)-1), function(i) paste0(splits[[i]][1], splits[[i+1]][2]), simplify=FALSE) } else { splits[[1]][2] } transpositions <- function(splits) if (length(splits) > 2) { swaps <- rev(bigram.unsafe(stri_reverse(splits[[1]][2]))) sapply(1:length(swaps), function(i) paste0(splits[[i]][1], swaps[i], splits[[i+2]][2]), simplify=FALSE) } else { stri_reverse(splits[[1]][2]) } replacements <- function(splits) if (length(splits) > 1) { sapply(1:(length(splits)-1), function(i) lapply(alphabet, function(symbol) paste0(splits[[i]][1], symbol, splits[[i+1]][2]))) } else { alphabet } insertions <- function(splits) sapply(splits, function(pair) lapply(alphabet, function(symbol) paste0(pair[1], symbol, pair[2]))) # 4. create a vector of all words at edit distance 1 given the input word edit.1 <- function(word) { splits <- binary.splits(word) unique(unlist(c(deletions(splits), transpositions(splits), replacements(splits), insertions(splits)))) } # 5. create a simple function to generate all words of edit distance 1 and 2 edit.2 <- function(word) { e1 <- edit.1(word) unique(c(unlist(lapply(e1, edit.1)), e1)) } 

If you start profiling this code, you will see that replacements and insertions have nested “bindings” and seem to take 10 times as much as deletions or transpositions , because they generate much more spelling options.

 library(rbenchmark) benchmark(edit.2('abcd'), replications=20) 

It takes about 8 seconds on my Core i5 MacBook Air, while the corresponding Python test (the corresponding edit2 function edit2 20 times) takes about 0.6 seconds, i.e. about 10-15 times faster!

I tried using expand.grid to get rid of the internal lapply , but this made the code slower rather than faster. And I know that using lapply instead of sapply makes my code a little faster, but I don’t see the point of using the “wrong” function (I want the vector back) for a little speed relief. But maybe generating the result of the edit.2 function can be done much faster in pure R ?

+8
python string r
source share
2 answers

Performance R paste0 vs. python '..join

The original header asks if paste0 in R was 10 times slower than string concatenation in python. If so, then there is no hope of writing an algorithm that is heavily dependent on string concatenation in R, which runs as fast as the corresponding python algorithm.

I have

 > R.version.string [1] "R version 3.1.0 Patched (2014-05-31 r65803)" 

and

 >>> sys.version '3.4.0 (default, Apr 11 2014, 13:05:11) \n[GCC 4.8.2]' 

Here is the first comparison

 > library(microbenchmark) > microbenchmark(paste0("a", "b"), times=1e6) Unit: nanoseconds expr min lq median uq max neval paste0("a", "b") 951 1071 1162 1293 21794972 1e+06 

(approximately 1 s for all replicas) compared to

 >>> import timeit >>> timeit.timeit("''.join(x)", "x=('a', 'b')", number=int(1e6)) 0.119668865998392 

I assume that the performance difference is 10 times that the original poster observed. However, R works better on vectors, and the algorithm includes word vectors anyway, so we might be interested in comparing

 > x = y = sample(LETTERS, 1e7, TRUE); system.time(z <- paste0(x, y)) user system elapsed 1.479 0.009 1.488 

and

 >>> setup = ''' import random import string y = x = [random.choice(string.ascii_uppercase) for _ in range(10000000)] ''' >>> timeit.Timer("map(''.join, zip(x, y))", setup=setup).repeat(1) [0.362522566007101] 

This suggests that we will be on the right track if our R algorithm were to work at 1/4 of the python speed; The OP found a 10-fold difference, so it looks like a place for improvement.

Iteration versus vectorization

OP uses iteration ( lapply and friends), not a vector. We can compare the vector version with different iteration approaches with the following

 f0 = paste0 f1 = function(x, y) vapply(seq_along(x), function(i, x, y) paste0(x[i], y[i]), character(1), x, y) f2 = function(x, y) Map(paste0, x, y) f3 = function(x, y) { z = character(length(x)) for (i in seq_along(x)) z[i] = paste0(x[i], y[i]) z } f3c = compiler::cmpfun(f3) # explicitly compile f4 = function(x, y) { z = character() for (i in seq_along(x)) z[i] = paste0(x[i], y[i]) z } 

Scaling back data, defining a “vectorized” solution as f0 and comparing these approaches

 > x = y = sample(LETTERS, 100000, TRUE) > library(microbenchmark) > microbenchmark(f0(x, y), f1(x, y), f2(x, y), f3(x, y), f3c(x, y), times=5) Unit: milliseconds expr min lq median uq max neval f0(x, y) 14.69877 14.70235 14.75409 14.98777 15.14739 5 f1(x, y) 241.34212 250.19018 268.21613 279.01582 292.21065 5 f2(x, y) 198.74594 199.07489 214.79558 229.50684 271.77853 5 f3(x, y) 250.64388 251.88353 256.09757 280.04688 296.29095 5 f3c(x, y) 174.15546 175.46522 200.09589 201.18543 214.18290 5 

with f4 too painfully slow to turn on

 > system.time(f4(x, y)) user system elapsed 24.325 0.000 24.330 

Thus, from this you can see the advice of Dr. Tierney, which may be useful for vectorizing these calls lapply .

Further vectorization of the updated original message

@fnl took the source code, partially looping. Opportunities remain, moreover, for example,

 replacements <- function(splits) if (length(splits$left) > 1) { lapply(1:(length(splits$left)-1), function(i) paste0(splits$left[i], alphabet, splits$right[i+1])) } else { splits$right[1] } 

can be revised to make a single paste call based on the recycling of arguments (short vectors are processed until their length matches longer vectors)

 replacements1 <- function(splits) if (length(splits$left) > 1) { len <- length(splits$left) paste0(splits$left[-len], rep(alphabet, each = len - 1), splits$right[-1]) } else { splits$right[1] } 

The values ​​are in a different order, but this is not important for the algorithm. Dropping indexes (prefix with -) is potentially more memory efficient. Output like this

 deletions1 <- function(splits) if (length(splits$left) > 1) { paste0(splits$left[-length(splits$left)], splits$right[-1]) } else { splits$right[1] } insertions1 <- function(splits) paste0(splits$left, rep(alphabet, each=length(splits$left)), splits$right) 

Then we have

 edit.1.1 <- function(word) { splits <- binary.splits(word) unique(c(deletions1(splits), transpositions(splits), replacements1(splits), insertions1(splits))) } 

with some acceleration

 > identical(sort(edit.1("word")), sort(edit.1.1("word"))) [1] TRUE > microbenchmark(edit.1("word"), edit.1.1("word")) Unit: microseconds expr min lq median uq max neval edit.1("word") 354.125 358.7635 362.5260 372.9185 521.337 100 edit.1.1("word") 296.575 298.9830 300.8305 307.3725 369.419 100 

The OP indicates that their initial version was 10 times slower than python, and that their initial modifications led to 5x upward speed. We get another 1.2x acceleration, so perhaps the expected performance of the algorithm using R paste0. The next step is to ask if alternative algorithms or implementations are more efficient; in particular, substr can be promising.

+4
source share

Following @LukeTierney's advice in question comments about vecotrizing paste0 calls and returning two binary.splits vectors, I edited the functions that need to be vectorized correctly. I added additional changes as described in his @MartinMorgan answer: discarding elements using single suffixes instead of using selection ranges (ie "[-1]" instead of "[2:n]" , etc., but NB : for several suffixes, as used in transpositions , this is actually slower) and, in particular, using rep to further vectorize paste0 calls to replacements and insertions .

The result is the best answer (so far?) For implementing edit.2 in R (thanks, Luke and Martin!). In other words, with Luke's main hints and some of Martin's subsequent improvements, the R implementation ends about twice as fast as Python (but see Martin's final comments in his answer below). (The functions edit.1 , edit.2 and bigram.unsafe remain unchanged, as shown above.)

 binary.splits <- function(w) { n <- nchar(w) list(left=stri_sub(w, rep(0, n + 1), 0:n), right=stri_sub(w, 1:(n + 1), rep(n, n + 1))) } deletions <- function(splits) { n <- length(splits$left) if (n > 1) paste0(splits$left[-n], splits$right[-1]) else splits$right[1] } transpositions <- function(splits) if (length(splits$left) > 2) { swaps <- rev(bigram.unsafe(stri_reverse(splits$right[1]))) paste0(splits$left[1:length(swaps)], swaps, splits$right[3:length(splits$right)]) } else { stri_reverse(splits$right[1]) } replacements <- function(splits) { n <- length(splits$left) if (n > 1) paste0(splits$left[-n], rep(alphabet, each=n-1), splits$right[-1]) else alphabet } insertions <- function(splits) paste0(splits$left, rep(alphabet, each=length(splits$left)), splits$right) 

Overall, and to complete this exercise, the Luke and Martin sentences made the R implementation about twice as fast as the Python code shown at the beginning, improving my original code about 6 times. Which bothers me more at the end, however, there are two different problems: (1) the R code seems much more verbose (LOC, but can be polished a bit) and (2) the fact that even a slight deviation from the “correct vectorization” makes R The code runs horribly, while in Python small deviations from the "correct Python" usually do not have such an extreme impact. However, I will continue my “efficient R coding work” - thanks to all the participants!

+1
source share

All Articles