Performance R paste0 vs. python '..join
The original header asks if paste0 in R was 10 times slower than string concatenation in python. If so, then there is no hope of writing an algorithm that is heavily dependent on string concatenation in R, which runs as fast as the corresponding python algorithm.
I have
> R.version.string [1] "R version 3.1.0 Patched (2014-05-31 r65803)"
and
>>> sys.version '3.4.0 (default, Apr 11 2014, 13:05:11) \n[GCC 4.8.2]'
Here is the first comparison
> library(microbenchmark) > microbenchmark(paste0("a", "b"), times=1e6) Unit: nanoseconds expr min lq median uq max neval paste0("a", "b") 951 1071 1162 1293 21794972 1e+06
(approximately 1 s for all replicas) compared to
>>> import timeit >>> timeit.timeit("''.join(x)", "x=('a', 'b')", number=int(1e6)) 0.119668865998392
I assume that the performance difference is 10 times that the original poster observed. However, R works better on vectors, and the algorithm includes word vectors anyway, so we might be interested in comparing
> x = y = sample(LETTERS, 1e7, TRUE); system.time(z <- paste0(x, y)) user system elapsed 1.479 0.009 1.488
and
>>> setup = ''' import random import string y = x = [random.choice(string.ascii_uppercase) for _ in range(10000000)] ''' >>> timeit.Timer("map(''.join, zip(x, y))", setup=setup).repeat(1) [0.362522566007101]
This suggests that we will be on the right track if our R algorithm were to work at 1/4 of the python speed; The OP found a 10-fold difference, so it looks like a place for improvement.
Iteration versus vectorization
OP uses iteration ( lapply and friends), not a vector. We can compare the vector version with different iteration approaches with the following
f0 = paste0 f1 = function(x, y) vapply(seq_along(x), function(i, x, y) paste0(x[i], y[i]), character(1), x, y) f2 = function(x, y) Map(paste0, x, y) f3 = function(x, y) { z = character(length(x)) for (i in seq_along(x)) z[i] = paste0(x[i], y[i]) z } f3c = compiler::cmpfun(f3) # explicitly compile f4 = function(x, y) { z = character() for (i in seq_along(x)) z[i] = paste0(x[i], y[i]) z }
Scaling back data, defining a “vectorized” solution as f0 and comparing these approaches
> x = y = sample(LETTERS, 100000, TRUE) > library(microbenchmark) > microbenchmark(f0(x, y), f1(x, y), f2(x, y), f3(x, y), f3c(x, y), times=5) Unit: milliseconds expr min lq median uq max neval f0(x, y) 14.69877 14.70235 14.75409 14.98777 15.14739 5 f1(x, y) 241.34212 250.19018 268.21613 279.01582 292.21065 5 f2(x, y) 198.74594 199.07489 214.79558 229.50684 271.77853 5 f3(x, y) 250.64388 251.88353 256.09757 280.04688 296.29095 5 f3c(x, y) 174.15546 175.46522 200.09589 201.18543 214.18290 5
with f4 too painfully slow to turn on
> system.time(f4(x, y)) user system elapsed 24.325 0.000 24.330
Thus, from this you can see the advice of Dr. Tierney, which may be useful for vectorizing these calls lapply .
Further vectorization of the updated original message
@fnl took the source code, partially looping. Opportunities remain, moreover, for example,
replacements <- function(splits) if (length(splits$left) > 1) { lapply(1:(length(splits$left)-1), function(i) paste0(splits$left[i], alphabet, splits$right[i+1])) } else { splits$right[1] }
can be revised to make a single paste call based on the recycling of arguments (short vectors are processed until their length matches longer vectors)
replacements1 <- function(splits) if (length(splits$left) > 1) { len <- length(splits$left) paste0(splits$left[-len], rep(alphabet, each = len - 1), splits$right[-1]) } else { splits$right[1] }
The values are in a different order, but this is not important for the algorithm. Dropping indexes (prefix with -) is potentially more memory efficient. Output like this
deletions1 <- function(splits) if (length(splits$left) > 1) { paste0(splits$left[-length(splits$left)], splits$right[-1]) } else { splits$right[1] } insertions1 <- function(splits) paste0(splits$left, rep(alphabet, each=length(splits$left)), splits$right)
Then we have
edit.1.1 <- function(word) { splits <- binary.splits(word) unique(c(deletions1(splits), transpositions(splits), replacements1(splits), insertions1(splits))) }
with some acceleration
> identical(sort(edit.1("word")), sort(edit.1.1("word"))) [1] TRUE > microbenchmark(edit.1("word"), edit.1.1("word")) Unit: microseconds expr min lq median uq max neval edit.1("word") 354.125 358.7635 362.5260 372.9185 521.337 100 edit.1.1("word") 296.575 298.9830 300.8305 307.3725 369.419 100
The OP indicates that their initial version was 10 times slower than python, and that their initial modifications led to 5x upward speed. We get another 1.2x acceleration, so perhaps the expected performance of the algorithm using R paste0. The next step is to ask if alternative algorithms or implementations are more efficient; in particular, substr can be promising.