Effectively place columns in a persistent group across data groups.

How can I efficiently retrieve constant columns by group from a data frame? I have included the plyr implementation below to pinpoint what I'm trying to do, but it is slow. How can I do this as efficiently as possible? (Ideally, without dividing the data frame at all).

base <- data.frame(group = 1:1000, a = sample(1000), b = sample(1000)) df <- data.frame( base[rep(seq_len(nrow(base)), length = 1e6), ], c = runif(1e6), d = runif(1e6) ) is.constant <- function(x) length(unique(x)) == 1 constant_cols <- function(x) head(Filter(is.constant, x), 1) system.time(constant <- ddply(df, "group", constant_cols)) # user system elapsed # 20.531 1.670 22.378 stopifnot(identical(names(constant), c("group", "a", "b"))) stopifnot(nrow(constant) == 1000) 

In my actual use case (deep inside ggplot2) there can be an arbitrary number of constant and non-constant columns. The data size in the example roughly corresponds to the order of magnitude.

+7
source share
6 answers

Inspired by @Joran's answer, here is a similar strategy that is slightly faster (1 s versus 1.5 s on my machine)

 changed <- function(x) c(TRUE, x[-1] != x[-n]) constant_cols2 <- function(df,grp){ df <- df[order(df[,grp]),] n <- nrow(df) changes <- lapply(df, changed) vapply(changes[-1], identical, changes[[1]], FUN.VALUE = logical(1)) } system.time(cols <- constant_cols2(df, "group")) # about 1 s system.time(constant <- df[changed(df$group), cols]) # user system elapsed # 1.057 0.230 1.314 stopifnot(identical(names(constant), c("group", "a", "b"))) stopifnot(nrow(constant) == 1000) 

It has the same drawbacks that it will not detect columns that have the same values ​​for adjacent groups (for example, df$f <- 1 )

With even more thinking, plus @David ideas:

 constant_cols3 <- function(df, grp) { # If col == TRUE and group == FALSE, not constant matching_breaks <- function(group, col) { !any(col & !group) } n <- nrow(df) changed <- function(x) c(TRUE, x[-1] != x[-n]) df <- df[order(df[,grp]),] changes <- lapply(df, changed) vapply(changes[-1], matching_breaks, group = changes[[1]], FUN.VALUE = logical(1)) } system.time(x <- constant_cols3(df, "group")) # user system elapsed # 1.086 0.221 1.413 

And it gives the correct result.

+3
source

(Edited to possibly solve the problem of consecutive groups with the same value)

I will post this answer first, but I have not fully convinced myself that it will correctly identify within the columns of the group of constants in all cases. But it is definitely faster (and probably could be improved):

 constant_cols1 <- function(df,grp){ df <- df[order(df[,grp]),] #Adjust values based on max diff in data rle_group <- rle(df[,grp]) vec <- rep(rep(c(0,ceiling(diff(range(df)))), length.out = length(rle_group$lengths)), times = rle_group$lengths) m <- matrix(vec,nrow = length(vec),ncol = ncol(df)-1) df_new <- df df_new[,-1] <- df[,-1] + m rles <- lapply(df_new,FUN = rle) nms <- names(rles) tmp <- sapply(rles[nms != grp], FUN = function(x){identical(x$lengths,rles[[grp]]$lengths)}) return(tmp) } 

My main idea was to use rle , obviously.

+4
source

I'm not sure if this is exactly what you are looking for, but it identifies columns a and b.

 require(data.table) is.constant <- function(x) identical(var(x), 0) dtOne <- data.table(df) system.time({dtTwo <- dtOne[, lapply(.SD, is.constant), by=group] result <- apply(X=dtTwo[, list(a, b, c, d)], 2, all) result <- result[result == TRUE] }) stopifnot(identical(names(result), c("a", "b"))) result 
+4
source

(edit: best answer)

Something like

is.constant<-function(x) length(which(x==x[1])) == length(x)

This seems like a good improvement. Compare the following.

 > a<-rnorm(5000000) > system.time(is.constant(a)) user system elapsed 0.039 0.010 0.048 > > system.time(is.constantOld(a)) user system elapsed 1.049 0.084 1.125 
+3
source

A bit slower than suggested above, but I think it should handle the case of equal adjacent groups

 findBreaks <- function(x) cumsum(rle(x)$lengths) constantGroups <- function(d, groupColIndex=1) { d <- d[order(d[, groupColIndex]), ] breaks <- lapply(d, findBreaks) groupBreaks <- breaks[[groupColIndex]] numBreaks <- length(groupBreaks) isSubset <- function(x) length(x) <= numBreaks && length(setdiff(x, groupBreaks)) == 0 unlist(lapply(breaks[-groupColIndex], isSubset)) } 

The intuition is that if a column is constant in a group, then gaps in the column values ​​(sorted by group value) will be a subset of gaps in the group value.

Now compare it to hadley (with a little modification to determine n)

 # df defined as in the question n <- nrow(df) changed <- function(x) c(TRUE, x[-1] != x[-n]) constant_cols2 <- function(df,grp){ df <- df[order(df[,grp]),] changes <- lapply(df, changed) vapply(changes[-1], identical, changes[[1]], FUN.VALUE = logical(1)) } > system.time(constant_cols2(df, 1)) user system elapsed 1.779 0.075 1.869 > system.time(constantGroups(df)) user system elapsed 2.503 0.126 2.614 > df$f <- 1 > constant_cols2(df, 1) abcdf TRUE TRUE FALSE FALSE FALSE > constantGroups(df) abcdf TRUE TRUE FALSE FALSE TRUE 
+3
source

How fast is.unsorted(x) runs for is.unsorted(x) x? Unfortunately, at the moment I do not have access to R. It also seems to be not your bottleneck, though.

+1
source

All Articles