Testing if rows of a matrix or data frame are sorted in R

What is an effective way to check the sorting of rows in a matrix? [Update: see Aaron Rcpp's answer - directly and very quickly.]

I am porting code that uses issorted(,'rows') from Matlab . It seems that is.unsorted doesn't go beyond vectors, I'm writing or looking for something else. The naive method is to verify that the sorted version of the matrix (or data frame) is the same as the original, but which is clearly inefficient.

NB: to sort aa sortrows() in Matlab, my code essentially uses SortedDF <- DF[do.call(order, DF),] (it completes a large function that converts matrices into data frames, passes parameters to order , etc. d.). I would not be surprised if there are faster implementations (a data table comes to mind).


Update 1: To clarify: I am not testing the sorting of intra-root or intra-column columns. (This sorting usually results in an algebraically different matrix.)

As an example, to create an unsorted matrix:

 set.seed(0) x <- as.data.frame(matrix(sample(3, 60, replace = TRUE), ncol = 6, byrow = TRUE)) 

Its sorted version:

 y <- x[do.call(order, x),] 

A testSorted test, say testSorted , will return FALSE for testSorted(x) and TRUE for testSorted(y) .

Update 2: The answers below are all good - they are concise and do the test. In terms of efficiency, it looks like they sort the data in the end.

I tried them with fairly large matrices such as 1M x 10 (just changing the creation of x above), and they all have the same cost of time and memory. What is special, they all consume more time for unsorted objects (about 5.5 seconds for 1Mx10) than for sorted objects (about 0.5 seconds for y ). This suggests that they are sorted before testing.

I tested by creating a z matrix:

 z <- y z[,2] <- y[,1] z[,1] <- y[,2] 

In this case, all methods take about 0.85 seconds. In any case, the completion in 5.5 seconds is not scary (in fact, it seems correct in terms of the time required to sort the object), but knowing that the sorted matrix is ​​11 times faster suggests that a test that does not sort can be even faster. In the case of the row matrix 1M, the first three rows x :

  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 3 1 2 2 3 1 3 3 2 2 2 1 1 1 3 2 3 2 3 3 2 3 3 3 1 2 1 1 2 1 2 3 

There is no need to look at line 2, although vectorization is a good idea.

(I also added a byrow argument to create x , so the string values ​​are independent of x size.)

Update 3: Another comparison for this test can be found using the sort -c command on Linux. If the file is already written (using write.table() ), with 1M lines, then time sort -c myfile.txt takes 0.003 seconds for unsorted data and 0.101 seconds for sorted data. I am not going to write the file, but this is a useful comparison.

Update 4: The Aaron Rcpp method used all the other methods suggested here and what I tried (including sort -c comparison above: it is expected that it will hit the disk in memory). As for the attitude towards other methods, it is difficult to say: the denominator is too small to give an accurate measurement, and I have not studied its microbenchmark in detail. The acceleration can be very large (4-5 orders of magnitude) for some matrices (for example, made using rnorm ), but this is misleading - the check can only complete after a few lines. I had acceleration with examples of matrices about 25-60 for unsorted and about 1.1X for sorted ones, since competing methods were very fast if the data was sorted.

Since this does the right thing (i.e. sorting, just testing), and does it very quickly, this is an accepted answer.

+7
source share
5 answers

Newer: I decided to use Rcpp practice ...

 library(Rcpp) library(inline) isRowSorted <- cxxfunction(signature(A="numeric"), body=' Rcpp::NumericMatrix Am(A); for(int i = 1; i < Am.nrow(); i++) { for(int j = 0; j < Am.ncol(); j++) { if( Am(i-1,j) < Am(i,j) ) { break; } if( Am(i-1,j) > Am(i,j) ) { return(wrap(false)); } } } return(wrap(true)); ', plugin="Rcpp") rownames(y) <- NULL # because as.matrix is faster without rownames isRowSorted(as.matrix(y)) 

New: this R-only hack the same speed for all matrices; it is definitely faster for sorted matrices; for unsorted it depends on the nature of unsortable.

 iss3 <- function(x) { x2 <- sign(do.call(cbind, lapply(x, diff))) x3 <- t(x2)*(2^((ncol(x)-1):0)) all(colSums(x3)>=0) } 

Original: for some unsorted matrices this is faster. How much faster will depend on where the unsorted items are; this looks at the matrix column by column, so the unsorted order on the left side will be noticed much faster than the unsorted one on the right, while the upper / lower value doesn't matter much the same.

 iss2 <- function(y) { b <- c(0,nrow(y)) for(i in 1:ncol(y)) { z <- rle(y[,i]) b2 <- cumsum(z$lengths) sp <- split(z$values, cut(b2, breaks=b)) for(spi in sp) { if(is.unsorted(spi)) return(FALSE) } b <- c(0, b2) } return(TRUE) } 
+4
source

If y is sorted, then do.call (order, y) returns 1: nrow (y).

  testSorted = function(y){all(do.call(order,y)==1:nrow(y))} 

Note that this does not compare matrices, but it does not drop out as soon as it finds a mismatch.

+6
source

Ok, why don't you use:

 all(do.call(order, y)==seq(nrow(y))) 

This avoids creating an ordered matrix and ensures that it checks your ordering style.

+6
source

Well, brute force approach is a cycle and comparison, interruption as soon as a violation is detected.

This approach can be easily implemented and tested in R, and then ported to a simple C ++ function that we can connect to R via inline and Rcpp (or just C if you need), since the loop is what is really beneficial for implementation in a compiled language.

Otherwise, can't you use something like diff() and check if all the increments are non-negative?

+4
source

You can use the do.call operator with is.unsorted :

 issorted.matrix <- function(A) {!is.unsorted(do.call("order",data.frame(A)))} 
+2
source

All Articles