A function for the median, like "who.max" and "which.min" / Extract median strings from data.frame

Question

A function for the median, like "who.max" and "which.min" / Extract median strings from data.frame

I sometimes need to extract certain rows from data.frame based on values from one of the variables. R has built-in functions for maximum ( which.max() ) and minimum ( which.min() ), which allow me to easily extract these lines.

Is there an equivalent for the median? Or is it best to write my own function?

Here is an example of data.frame and how I will use which.max() and which.min() :

 set.seed(1) # so you can reproduce this example dat = data.frame(V1 = 1:10, V2 = rnorm(10), V3 = rnorm(10), V4 = sample(1:20, 10, replace=T)) # To return the first row, which contains the max value in V4 dat[which.max(dat$V4), ] # To return the seventh row, which contains the min value in V4 dat[which.min(dat$V4), ]

In this particular example, since there is an even number of observations, I would need to return two rows, in this case rows 2 and 10.

Update

It would seem that there is no built-in function for this. Thus, using the answer from Sacha as a starting point, I wrote this function:

 which.median = function(x) { if (length(x) %% 2 != 0) { which(x == median(x)) } else if (length(x) %% 2 == 0) { a = sort(x)[c(length(x)/2, length(x)/2+1)] c(which(x == a[1]), which(x == a[2])) } }

I can use it as follows:

 # make one data.frame with an odd number of rows dat2 = dat[-10, ] # Median rows from 'dat' (even number of rows) and 'dat2' (odd number of rows) dat[which.median(dat$V4), ] dat2[which.median(dat2$V4), ]

Are there any suggestions for improving this?

+8

r dataframe subset

A5C1D2H2I1M1N2O1R2T1 Apr 21 '12 at 5:30

source share

4 answers

I just think:

 which(dat$V4 == median(dat$V4))

But be careful, as the median takes the average of two numbers if there is not one average. For example. median(1:4) gives 2.5, which does not match any of the elements.

Edit

Here is a function that will give you either the median element or the first match with the average value of the median, similar to which.min() gives you the first element that is equal to the minimum only:

 whichmedian <- function(x) which.min(abs(x - median(x)))

For example:

 > whichmedian(1:4) [1] 2

+7

Sacha epskamp Apr 21 '12 at 5:48

source share

I wrote a more complete function that satisfies my needs:

 row.extractor = function(data, extract.by, what) { # data = your data.frame # extract.by = the variable that you are extracting by, either # as its index number or by name # what = either "min", "max", "median", or "all", with quotes if (is.numeric(extract.by) == 1) { extract.by = extract.by } else if (is.numeric(extract.by) != 0) { extract.by = which(colnames(dat) %in% "extract.by") } which.median = function(data, extract.by) { a = data[, extract.by] if (length(a) %% 2 != 0) { which(a == median(a)) } else if (length(a) %% 2 == 0) { b = sort(a)[c(length(a)/2, length(a)/2+1)] c(max(which(a == b[1])), min(which(a == b[2]))) } } X1 = data[which(data[extract.by] == min(data[extract.by])), ] X2 = data[which(data[extract.by] == max(data[extract.by])), ] X3 = data[which.median(data, extract.by), ] if (what == "min") { X1 } else if (what == "max") { X2 } else if (what == "median") { X3 } else if (what == "all") { rbind(X1, X3, X2) } }

Example usage example:

 > row.extractor(dat, "V4", "max") V1 V2 V3 V4 1 1 -0.6264538 1.511781 17 > row.extractor(dat, 4, "min") V1 V2 V3 V4 7 7 0.4874291 -0.01619026 1 > row.extractor(dat, "V4", "all") V1 V2 V3 V4 7 7 0.4874291 -0.01619026 1 2 2 0.1836433 0.38984324 13 10 10 -0.3053884 0.59390132 14 4 1 -0.6264538 1.51178117 17

+2

A5C1D2H2I1M1N2O1R2T1 Apr 21 '12 at 9:23

source share

Suppose the vector from which you want to get the median is x .

The function which.min(x[x>=median(x)]) will give the median if length(x)=2*n+1 or the larger of the two mean values if length(x)=2*n . You can change it a little if you want the smaller of the two averages.

+2

Yimai Oct 25 '15 at 9:39

source share

cbeleites · Accepted Answer · 2012-04-21T10:14:09+0000

While Sacha's solution is fairly general, median (or other quantiles) are order statistics, so you can calculate the corresponding indices from order (x) (instead of sort (x) for quantile values).

Looking at quantile , you can use types 1 or 3, all the rest lead to (weighted) average values of the two values in some cases.

I chose type 3, and copying and pasting a little from quantile results in:

 which.quantile <- function (x, probs, na.rm = FALSE){ if (! na.rm & any (is.na (x))) return (rep (NA_integer_, length (probs))) o <- order (x) n <- sum (! is.na (x)) o <- o [seq_len (n)] nppm <- n * probs - 0.5 j <- floor(nppm) h <- ifelse((nppm == j) & ((j%%2L) == 0L), 0, 1) j <- j + h j [j == 0] <- 1 o[j] }

A small test:

 > x <-c (2.34, 5.83, NA, 9.34, 8.53, 6.42, NA, 8.07, NA, 0.77) > probs <- c (0, .23, .5, .6, 1) > which.quantile (x, probs, na.rm = TRUE) [1] 10 1 6 6 4 > x [which.quantile (x, probs, na.rm = TRUE)] == quantile (x, probs, na.rm = TRUE, type = 3) 0% 23% 50% 60% 100% TRUE TRUE TRUE TRUE TRUE

Here is your example:

 > dat [which.quantile (dat$V4, c (0, .5, 1)),] V1 V2 V3 V4 7 7 0.4874291 -0.01619026 1 2 2 0.1836433 0.38984324 13 1 1 -0.6264538 1.51178117 17

A function for the median, like "who.max" and "which.min" / Extract median strings from data.frame

Update

Edit

More articles: