Order column data block in Rcpp

Is there an easy way to order a DataFrame with two (or more or one) of its columns in RCpp?

There are many sorting algorithms available on the net, or can I use std::sort with a wrapper for a DataFrame, but I was wondering if there is anything already in RCpp or RCppArmadillo?

I need to do this sorting / ordering as part of another function

 DataFrame myFunc(DataFrame myDF, NumericVector x) { //// some code here DataFrame myDFsorted = sort (myDF, someColName1, someColName2) // how to sort?? //// some code here } 

I would like to avoid accessing the R order function in RCpp (to preserve the speed of the RCpp code).

Many thanks

+8
r rcpp
source share
2 answers

The difficulty is that a data frame is a collection of vectors of potentially different types; We need a way to arrange them independently of these types (integer, character, ...). At dplyr, we developed what we call bills visitors. For this specific task, we need the OrderVisitor set, which has the following interface:

 class OrderVisitor { public: virtual ~OrderVisitor(){} /** are the elements at indices i and j equal */ virtual bool equal(int i, int j) const = 0 ; /** is the i element less than the j element */ virtual bool before( int i, int j) const = 0 ; virtual SEXP get() = 0 ; } ; 

dplyr then has OrderVisitor implementations for all types that we support in this file , and we have the dispatch function order_visitor , which makes OrderVisitor* from the vector.

With this we can save a set of visitors to vectors in std::vector<OrderVisitor*> ; OrderVisitors has a constructor with the DataFrame and CharacterVector vector names that we want to use for ordering.

 OrderVisitors o(data, names ) ; 

Then we can use the OrderVisitors.apply method, which essentially does lexicographic ordering:

 IntegerVector index = o.apply() ; 

The apply method is implemented by simply initializing the IntegerVector with 0..n and then std::sort according to the visitors.

 inline Rcpp::IntegerVector OrderVisitors::apply() const { IntegerVector x = seq(0, nrows -1 ) ; std::sort( x.begin(), x.end(), OrderVisitors_Compare(*this) ) ; return x ; } 

It is important here how the OrderVisitors_Compare class implements operator()(int,int) :

 inline bool operator()(int i, int j) const { if( i == j ) return false ; for( int k=0; k<n; k++) if( ! obj.visitors[k]->equal(i,j) ) return obj.visitors[k]->before(i, j ) ; return i < j ; } 

So, at this point, index gives us integer indices of sorted data, we just need to make a new DataFrame from data by a subset of data with these indices. To do this, we have another kind of visitors, enclosed in the DataFrameVisitors class. First we create a DataFrameVisitors :

 DataFrameVisitors visitors( data ) ; 

This encapsulates a std::vector<VectorVisitor*> . Each of these VectorVisitor* knows how to subset itself using an integer vector index. This is used from DataFrameVisitors.subset :

 template <typename Container> DataFrame subset( const Container& index, const CharacterVector& classes ) const { List out(nvisitors); for( int k=0; k<nvisitors; k++){ out[k] = get(k)->subset(index) ; } structure( out, Rf_length(out[0]) , classes) ; return (SEXP)out ; } 

To wrap this up, here is a simple function using tools developed by dplyr:

 #include <dplyr.h> // [[Rcpp::depends(dplyr)]] using namespace Rcpp ; using namespace dplyr ; // [[Rcpp::export]] DataFrame myFunc(DataFrame data, CharacterVector names) { OrderVisitors o(data, names ) ; IntegerVector index = o.apply() ; DataFrameVisitors visitors( data ) ; DataFrame res = visitors.subset(index, "data.frame" ) ; return res ; } 
+11
source share

Since data.frame is really a list of columns in C ++, you will have to reorder all your columns individually based on the new ording index. This is different from how indexing [.., ..] works in R for a data.frame .

See this Rcpp Gallery article on sorting vectors for some pointers. You will probably have to provide a new ordering index that will be used, after which it will be just a matter of indexing - and this one also has some entries in the Gallery.

This SO post can help you get started creating the index; this bytes.com comment discusses the same idea.

Edit: both Armadillo has a function sort_index() and stable_sort_index() to create the index needed to reinstall the columns. This applies only to the single-column case and is limited to numeric columns, but is the beginning.

+3
source share

All Articles