The Rcpp tab version is slower; where is it how to understand

Question

The Rcpp tab version is slower; where is it how to understand

In the process of creating some sampling functions for already aggregated data, I found that the table was quite slow in relation to the size data I am working with. I tried two improvements, first the Rcpp function as follows

// [[Rcpp::export]] IntegerVector getcts(NumericVector x, int m) { IntegerVector cts(m); int t; for (int i = 0; i < x.length(); i++) { t = x[i] - 1; if (0 <= t && t < m) cts[t]++; } return cts; }

And then, trying to understand why the table was rather slow, I found that it was based on tabulation. Tablature works well for me and faster than the Rcpp version. Tab Code:

https://github.com/wch/r-source/blob/545d365bd0485e5f0913a7d609c2c21d1f43145a/src/main/util.c#L2204

With key line:

 for(R_xlen_t i = 0 ; i < n ; i++) if (x[i] != NA_INTEGER && x[i] > 0 && x[i] <= nb) y[x[i] - 1]++;

Now the key parts of the tab and my version of Rcpp look pretty close (I didn't worry about dealing with NA).

Q1: Why is my version of Rcpp 3 times slower?

Q2: How can I find out where this time is going?

I would really appreciate it to be time, but a good way to profile the code would be even better. My C ++ skills are just that, but it seems simple enough that I should (cross my fingers) avoid any stupid things that triple my time.

My time code:

 max_x <- 100 xs <- sample(seq(max_x), size = 50000000, replace = TRUE) system.time(getcts(xs, max_x)) system.time(tabulate(xs))

This gives 0.318 for getcts and 0.126 for tabs.

+5

r rcpp

kasterma Jun 23 '15 at 11:20

source share

1 answer

Artem klevtsov · Accepted Answer · 2016-05-11T07:56:00+0000

Your function calls the length method in each iteration of the loop. The compiler does not seem to cache it. To fix this vector storage size in a separate variable or use a range-based loop. Also note that we really do not need to explicitly check for missing values, because in C ++ all comparisons with NaN always return false .

Let me compare the performance:

 #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] IntegerVector tabulate1(const IntegerVector& x, const unsigned max) { IntegerVector counts(max); for (std::size_t i = 0; i < x.size(); i++) { if (x[i] > 0 && x[i] <= max) counts[x[i] - 1]++; } return counts; } // [[Rcpp::export]] IntegerVector tabulate2(const IntegerVector& x, const unsigned max) { IntegerVector counts(max); std::size_t n = x.size(); for (std::size_t i = 0; i < n; i++) { if (x[i] > 0 && x[i] <= max) counts[x[i] - 1]++; } return counts; } // [[Rcpp::plugins(cpp11)]] // [[Rcpp::export]] IntegerVector tabulate3(const IntegerVector& x, const unsigned max) { IntegerVector counts(max); for (auto& now : x) { if (now > 0 && now <= max) counts[now - 1]++; } return counts; } // [[Rcpp::plugins(cpp11)]] // [[Rcpp::export]] IntegerVector tabulate4(const IntegerVector& x, const unsigned max) { IntegerVector counts(max); for (auto it = x.begin(); it != x.end(); it++) { if (*it > 0 && *it <= max) counts[*it - 1]++; } return counts; } /***R library(microbenchmark) x <- sample(10, 1e5, rep = TRUE) microbenchmark( tabulate(x, 10), tabulate1(x, 10), tabulate2(x, 10), tabulate3(x, 10), tabulate4(x, 10) ) x[sample(10e5, 10e3)] <- NA microbenchmark( tabulate(x, 10), tabulate1(x, 10), tabulate2(x, 10), tabulate3(x, 10), tabulate4(x, 10) ) */

tabulate1 is the original version.

Test results:

Without NA :

 Unit: microseconds expr min lq mean median uq max neval tabulate(x, 10) 143.557 146.8355 169.2820 156.1970 177.327 286.370 100 tabulate1(x, 10) 390.706 392.6045 437.7357 416.5655 443.065 748.767 100 tabulate2(x, 10) 108.149 111.4345 139.7579 118.2735 153.118 337.647 100 tabulate3(x, 10) 107.879 111.7305 138.2711 118.8650 139.598 300.023 100 tabulate4(x, 10) 391.003 393.4530 436.3063 420.1915 444.048 777.862 100

With NA :

 Unit: microseconds expr min lq mean median uq max neval tabulate(x, 10) 943.555 1089.5200 1614.804 1333.806 2042.320 3986.836 100 tabulate1(x, 10) 4523.076 4787.3745 5258.490 4929.586 5624.098 7233.029 100 tabulate2(x, 10) 765.102 931.9935 1361.747 1113.550 1679.024 3436.356 100 tabulate3(x, 10) 773.358 914.4980 1350.164 1140.018 1642.354 3633.429 100 tabulate4(x, 10) 4241.025 4466.8735 4933.672 4717.016 5148.842 8603.838 100

The tabulate4 function, which uses an iterator, is also slower than tabulate . We can improve it:

 // [[Rcpp::plugins(cpp11)]] // [[Rcpp::export]] IntegerVector tabulate4(const IntegerVector& x, const unsigned max) { IntegerVector counts(max); auto start = x.begin(); auto end = x.end(); for (auto it = start; it != end; it++) { if (*(it) > 0 && *(it) <= max) counts[*(it) - 1]++; } return counts; }

The Rcpp tab version is slower; where is it how to understand

More articles: