The difficulty is that a data frame is a collection of vectors of potentially different types; We need a way to arrange them independently of these types (integer, character, ...). At dplyr, we developed what we call bills visitors. For this specific task, we need the OrderVisitor set, which has the following interface:
class OrderVisitor { public: virtual ~OrderVisitor(){} virtual bool equal(int i, int j) const = 0 ; virtual bool before( int i, int j) const = 0 ; virtual SEXP get() = 0 ; } ;
dplyr then has OrderVisitor implementations for all types that we support in this file , and we have the dispatch function order_visitor , which makes OrderVisitor* from the vector.
With this we can save a set of visitors to vectors in std::vector<OrderVisitor*> ; OrderVisitors has a constructor with the DataFrame and CharacterVector vector names that we want to use for ordering.
OrderVisitors o(data, names ) ;
Then we can use the OrderVisitors.apply method, which essentially does lexicographic ordering:
IntegerVector index = o.apply() ;
The apply method is implemented by simply initializing the IntegerVector with 0..n and then std::sort according to the visitors.
inline Rcpp::IntegerVector OrderVisitors::apply() const { IntegerVector x = seq(0, nrows -1 ) ; std::sort( x.begin(), x.end(), OrderVisitors_Compare(*this) ) ; return x ; }
It is important here how the OrderVisitors_Compare class implements operator()(int,int) :
inline bool operator()(int i, int j) const { if( i == j ) return false ; for( int k=0; k<n; k++) if( ! obj.visitors[k]->equal(i,j) ) return obj.visitors[k]->before(i, j ) ; return i < j ; }
So, at this point, index gives us integer indices of sorted data, we just need to make a new DataFrame from data by a subset of data with these indices. To do this, we have another kind of visitors, enclosed in the DataFrameVisitors class. First we create a DataFrameVisitors :
DataFrameVisitors visitors( data ) ;
This encapsulates a std::vector<VectorVisitor*> . Each of these VectorVisitor* knows how to subset itself using an integer vector index. This is used from DataFrameVisitors.subset :
template <typename Container> DataFrame subset( const Container& index, const CharacterVector& classes ) const { List out(nvisitors); for( int k=0; k<nvisitors; k++){ out[k] = get(k)->subset(index) ; } structure( out, Rf_length(out[0]) , classes) ; return (SEXP)out ; }
To wrap this up, here is a simple function using tools developed by dplyr:
#include <dplyr.h> // [[Rcpp::depends(dplyr)]] using namespace Rcpp ; using namespace dplyr ; // [[Rcpp::export]] DataFrame myFunc(DataFrame data, CharacterVector names) { OrderVisitors o(data, names ) ; IntegerVector index = o.apply() ; DataFrameVisitors visitors( data ) ; DataFrame res = visitors.subset(index, "data.frame" ) ; return res ; }