A function for the median, like "who.max" and "which.min" / Extract median strings from data.frame

I sometimes need to extract certain rows from data.frame based on values ​​from one of the variables. R has built-in functions for maximum ( which.max() ) and minimum ( which.min() ), which allow me to easily extract these lines.

Is there an equivalent for the median? Or is it best to write my own function?

Here is an example of data.frame and how I will use which.max() and which.min() :

 set.seed(1) # so you can reproduce this example dat = data.frame(V1 = 1:10, V2 = rnorm(10), V3 = rnorm(10), V4 = sample(1:20, 10, replace=T)) # To return the first row, which contains the max value in V4 dat[which.max(dat$V4), ] # To return the seventh row, which contains the min value in V4 dat[which.min(dat$V4), ] 

In this particular example, since there is an even number of observations, I would need to return two rows, in this case rows 2 and 10.

Update

It would seem that there is no built-in function for this. Thus, using the answer from Sacha as a starting point, I wrote this function:

 which.median = function(x) { if (length(x) %% 2 != 0) { which(x == median(x)) } else if (length(x) %% 2 == 0) { a = sort(x)[c(length(x)/2, length(x)/2+1)] c(which(x == a[1]), which(x == a[2])) } } 

I can use it as follows:

 # make one data.frame with an odd number of rows dat2 = dat[-10, ] # Median rows from 'dat' (even number of rows) and 'dat2' (odd number of rows) dat[which.median(dat$V4), ] dat2[which.median(dat2$V4), ] 

Are there any suggestions for improving this?

+8
r dataframe subset
source share
4 answers

While Sacha's solution is fairly general, median (or other quantiles) are order statistics, so you can calculate the corresponding indices from order (x) (instead of sort (x) for quantile values).

Looking at quantile , you can use types 1 or 3, all the rest lead to (weighted) average values ​​of the two values ​​in some cases.

I chose type 3, and copying and pasting a little from quantile results in:

 which.quantile <- function (x, probs, na.rm = FALSE){ if (! na.rm & any (is.na (x))) return (rep (NA_integer_, length (probs))) o <- order (x) n <- sum (! is.na (x)) o <- o [seq_len (n)] nppm <- n * probs - 0.5 j <- floor(nppm) h <- ifelse((nppm == j) & ((j%%2L) == 0L), 0, 1) j <- j + h j [j == 0] <- 1 o[j] } 

A small test:

 > x <-c (2.34, 5.83, NA, 9.34, 8.53, 6.42, NA, 8.07, NA, 0.77) > probs <- c (0, .23, .5, .6, 1) > which.quantile (x, probs, na.rm = TRUE) [1] 10 1 6 6 4 > x [which.quantile (x, probs, na.rm = TRUE)] == quantile (x, probs, na.rm = TRUE, type = 3) 0% 23% 50% 60% 100% TRUE TRUE TRUE TRUE TRUE 

Here is your example:

 > dat [which.quantile (dat$V4, c (0, .5, 1)),] V1 V2 V3 V4 7 7 0.4874291 -0.01619026 1 2 2 0.1836433 0.38984324 13 1 1 -0.6264538 1.51178117 17 
+15
source share

I just think:

 which(dat$V4 == median(dat$V4)) 

But be careful, as the median takes the average of two numbers if there is not one average. For example. median(1:4) gives 2.5, which does not match any of the elements.

Edit

Here is a function that will give you either the median element or the first match with the average value of the median, similar to which.min() gives you the first element that is equal to the minimum only:

 whichmedian <- function(x) which.min(abs(x - median(x))) 

For example:

 > whichmedian(1:4) [1] 2 
+7
source share

I wrote a more complete function that satisfies my needs:

 row.extractor = function(data, extract.by, what) { # data = your data.frame # extract.by = the variable that you are extracting by, either # as its index number or by name # what = either "min", "max", "median", or "all", with quotes if (is.numeric(extract.by) == 1) { extract.by = extract.by } else if (is.numeric(extract.by) != 0) { extract.by = which(colnames(dat) %in% "extract.by") } which.median = function(data, extract.by) { a = data[, extract.by] if (length(a) %% 2 != 0) { which(a == median(a)) } else if (length(a) %% 2 == 0) { b = sort(a)[c(length(a)/2, length(a)/2+1)] c(max(which(a == b[1])), min(which(a == b[2]))) } } X1 = data[which(data[extract.by] == min(data[extract.by])), ] X2 = data[which(data[extract.by] == max(data[extract.by])), ] X3 = data[which.median(data, extract.by), ] if (what == "min") { X1 } else if (what == "max") { X2 } else if (what == "median") { X3 } else if (what == "all") { rbind(X1, X3, X2) } } 

Example usage example:

 > row.extractor(dat, "V4", "max") V1 V2 V3 V4 1 1 -0.6264538 1.511781 17 > row.extractor(dat, 4, "min") V1 V2 V3 V4 7 7 0.4874291 -0.01619026 1 > row.extractor(dat, "V4", "all") V1 V2 V3 V4 7 7 0.4874291 -0.01619026 1 2 2 0.1836433 0.38984324 13 10 10 -0.3053884 0.59390132 14 4 1 -0.6264538 1.51178117 17 
+2
source share

Suppose the vector from which you want to get the median is x .

The function which.min(x[x>=median(x)]) will give the median if length(x)=2*n+1 or the larger of the two mean values ​​if length(x)=2*n . You can change it a little if you want the smaller of the two averages.

+2
source share

All Articles