Fast acceleration / dewaxing of character vectors in R

Question

Fast acceleration / dewaxing of character vectors in R

To encode strings in json, multiple reserved characters must be escaped with a backslash and each string must be enclosed in double quotes. The jsonlite package jsonlite implements this using the deparse function in the R database:

 deparse_vector <- function(x) { stopifnot(is.character(x)) vapply(x, deparse, character(1), USE.NAMES=FALSE) }

This is the trick:

 test <- c("line\nline", "foo\\bar", "I said: \"hi!\"") cat(deparse_vector(test))

However, deparse is slow for large vectors. An alternative implementation is to gsub each character separately:

 deparse_vector2 <- function(x) { stopifnot(is.character(x)) if(!length(x)) return(x) x <- gsub("\\", "\\\\", x, fixed=TRUE) x <- gsub("\"", "\\\"", x, fixed=TRUE) x <- gsub("\n", "\\n", x, fixed=TRUE) x <- gsub("\r", "\\r", x, fixed=TRUE) x <- gsub("\t", "\\t", x, fixed=TRUE) x <- gsub("\b", "\\b", x, fixed=TRUE) x <- gsub("\f", "\\f", x, fixed=TRUE) paste0("\"", x, "\"") }

It is a little faster, but not very, very ugly. What would be the best way to do this? (preferably without additional dependencies)

This script can be used to compare implementations:

 > system.time(out1 <- deparse_vector1(strings)) user system elapsed 6.517 0.000 6.523 > system.time(out2 <- deparse_vector2(strings)) user system elapsed 1.194 0.000 1.194

+7

regex r escaping gsub

Jeroen Sep 01 '14 at 15:38

source share

4 answers

Here is the C ++ version of Winston code. This is much simpler because you can grow std::string s efficiently. It is also less likely to crash because Rcpp takes care of memory management for you.

 #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] std::string escape_one(std::string x) { std::string out = "\""; int n = x.size(); for (int i = 0; i < n; ++i) { char cur = x[i]; switch(cur) { case '\\': out += "\\\\"; break; case '"': out += "\\\""; break; case '\n': out += "\\n"; break; case '\r': out += "\\r"; break; case '\t': out += "\\t"; break; case '\b': out += "\\b"; break; case '\f': out += "\\f"; break; default: out += cur; } } out += '"'; return out; } // [[Rcpp::export]] CharacterVector escape_chars(CharacterVector x) { int n = x.size(); CharacterVector out(n); for (int i = 0; i < n; ++i) { String cur = x[i]; out[i] = escape_one(cur); } return out; }

In your test, deparse_vector2(strings) takes 0.8 s and escape_chars(strings) takes 0.165 s.

+6

hadley Sep 2 '14 at 12:59

source share

You can also try stri_escape_unicode from the stri_escape_unicode package (although you prefer a solution without additional dependencies, but I think it might be useful for future readers as well), which is about 3 times faster than deparse_vector2 and about 7 times faster than deparse_vector

 require(stringi)

Function Definition

 deparse_vector3 <- function(x){ paste0("\"",stri_escape_unicode(x), "\"") }

Checking that all functions give smae result

 all.equal(deparse_vector2(test), deparse_vector3(test)) ## [1] TRUE all.equal(deparse_vector(test), deparse_vector3(test)) ## [1] TRUE

Some tests

 library(microbenchmark) microbenchmark(deparse_vector(test), deparse_vector2(test), deparse_vector3(test), times = 1000L) # Unit: microseconds # expr min lq median uq max neval # deparse_vector(test) 98.548 102.654 104.707 111.380 2500.653 1000 # deparse_vector2(test) 43.114 46.707 48.761 51.327 401.377 1000 # deparse_vector3(test) 14.885 16.938 18.991 20.018 240.211 1000 <-- Clear winner

+3

David Arenburg Sep 2 '14 at 23:21

source share

Another hit on this issue that uses a couple of facts.

For string x with length n we know that the output string will be no less than length x and no more than 2 * x . We can take advantage of this to ensure that memory is allocated only once, rather than relying on containers that grow (albeit efficiently).

Please note that here I am using C ++ 11 shared_ptr , as I am doing ugly things with raw memory (and want it to be automatically cleared). It also allows me to avoid the initial pass, in which I try to count matches, but also makes me over-allocate bits excessively (the case where each individual character needs to be escaped will be rare).

It would be relatively easy to adapt this solution to a pure C solution, I think, but it would be more difficult to ensure that the memory is cleared properly.

 #include <memory> #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] void escape_one_fill(CharacterVector const& x, int i, CharacterVector& output) { auto xi = CHAR(STRING_ELT(x, i)); int n = strlen(xi); // Over-allocate memory -- we know that in the worst case the output // string is 2x the length of x (plus 1 for \0) auto out = std::make_shared<char*>(new char[n * 2 + 1]); int counter = 0; (*out)[counter++] = '"'; #define HANDLE_CASE(X, Y) \ case X: \ (*out)[counter++] = '\\'; \ (*out)[counter++] = Y; \ break; for (int j = 0; j < n; ++j) { switch (xi[j]) { HANDLE_CASE('\\', '\\'); HANDLE_CASE('"', '"'); HANDLE_CASE('\n', 'n'); HANDLE_CASE('\r', 'r'); HANDLE_CASE('\t', 't'); HANDLE_CASE('\b', 'b'); HANDLE_CASE('\f', 'f'); default: (*out)[counter++] = xi[j]; } } (*out)[counter++] = '"'; // Set a NUL so that Rf_mkChar does what it should (*out)[counter++] = '\0'; SET_STRING_ELT(output, i, Rf_mkChar(*out)); } // [[Rcpp::export]] CharacterVector escape_chars_with_fill(CharacterVector x) { int n = x.size(); CharacterVector out(n); for (int i = 0; i < n; ++i) { escape_one_fill(x, i, out); } return out; }

Comparing this, I get (just comparing with a Hadley impl):

 > mychars <- c(letters, " ", '"', "\\", "\t", "\n", "\r", "'", "/", "#", "$"); > createstring <- function(length){ + paste(mychars[ceiling(runif(length, 0, length(mychars)))], collapse="") + } > strings <- vapply(rep(1000, 10000), createstring, character(1), USE.NAMES=FALSE) > system.time(escape_chars(strings)) user system elapsed 0.14 0.00 0.14 > system.time(escape_chars_with_fill(strings)) user system elapsed 0.080 0.001 0.081 > identical(escape_chars(strings), escape_chars_with_fill(strings)) [1] TRUE

+2

Kevin ushey 10 Sep '14 at 4:47

source share

wch · Accepted Answer · 2014-09-01T23:06:52+0000

I don’t know a faster way to do this using only R-code, but I decided to try my hand at implementing it in C, wrapped in an R-function called deparse_vector3 . It's rude (and I'm far from a C programming expert), but it seems to work for your examples: https://gist.github.com/wch/e3ec5b20eb712f1b22b2

On my system (Mac, R 3.1.1), deparse_vector2 over 20x faster than deparse_vector , which is much more than the 5x you got in your test.

My deparse_vector3 function deparse_vector3 only 3 times faster than deparse_vector2 . There probably is room for improvement.

 > system.time(out1 <- deparse_vector1(strings)) user system elapsed 8.459 0.009 8.470 > system.time(out2 <- deparse_vector2(strings)) user system elapsed 0.368 0.007 0.374 > system.time(out3 <- deparse_vector3(strings)) user system elapsed 0.120 0.001 0.120

I do not think that this will correctly handle character encodings other than ASCII. Here is an example of how encoding is handled in source R: https://github.com/wch/r-source/blob/trunk/src/main/grep.c#L704-L739

Edit: It seems that it handles UTF-8 in order, although it is possible that I missed something in my testing.

Fast acceleration / dewaxing of character vectors in R

More articles: