R, filter matrix based on variance variance

See edit below. Using R, I would like to filter out the matrix (of gene expression data) and save only the rows (genes / probes) that have high dispersion values. For example, I would like to save only rows that have values ​​in the lower and upper percentiles (for example, below 20% and above 80%). I want to limit my research only to high dispersion genes for downstream analysis. Are there general ways to filter genes in R?

My matrix has 18 samples (columns) and 47,000 probes (rows) with values ​​that are converted and normalized by log2. I know that the quantile() function can identify cutoffs of 20% and 80% in each column of the sample. I cannot figure out how to find these values ​​for the entire matrix, and then multiply the original matrix to remove all the "immutable" rows.

An example of a matrix with an average value of 5.97, so the last three rows should be deleted, because they contain values ​​between cutoffs of 20% and 80%:

 > m sample1 sample2 sample3 sample4 sample5 sample6 ILMN_1762337 7.86 5.05 4.89 5.74 6.78 6.41 ILMN_2055271 5.72 4.29 4.64 5.00 6.30 8.02 ILMN_1736007 3.82 6.48 6.06 7.13 8.20 4.06 ILMN_2383229 6.34 4.34 6.12 6.83 4.82 5.57 ILMN_1806310 6.15 6.37 5.54 5.22 4.59 6.28 ILMN_1653355 7.01 4.73 6.62 6.27 4.77 6.12 ILMN_1705025 6.09 6.68 6.80 6.85 8.35 4.15 ILMN_1814316 5.77 5.17 5.94 6.51 7.12 7.20 ILMN_1814317 5.97 5.97 5.97 5.97 5.97 5.97 ILMN_1814318 5.97 5.97 5.97 5.97 5.97 5.97 ILMN_1814319 5.97 5.97 5.97 5.97 5.97 5.97 

I would appreciate any suggestions or features that I should study. Thanks!

EDIT

Sorry, I was not very clear in the OP. (1) I would like to know the cutoff values ​​of 20% and 80% for the entire matrix (not only for each individual sample). (2) Then, if any row contains a value in the upper or lower percentiles, R will store these rows. If the row contains values ​​(for all samples) that approach the average value, these rows are discarded.

+7
source share
3 answers

Well, assuming you have a matrix (so I assume your identifier column is actually the names of the growths), then this is very simple to do.

 # First find the desired quantile breaks for the entire matrix qt <- quantile( m , probs = c(0.2,0.8) ) # 20% 80% #5.17 6.62 # Next get a logical vector of the rows that have any values outside these breaks rows <- apply( m , 1 , function(x) any( x < qt[1] | x > qt[2] ) ) # Subset on this vector m[ rows , ] # sample1 sample2 sample3 sample4 sample5 sample6 #ILMN_1762337 7.86 5.05 4.89 5.74 6.78 6.41 #ILMN_2055271 5.72 4.29 4.64 5.00 6.30 8.02 #ILMN_1736007 3.82 6.48 6.06 7.13 8.20 4.06 #ILMN_2383229 6.34 4.34 6.12 6.83 4.82 5.57 #ILMN_1806310 6.15 6.37 5.54 5.22 4.59 6.28 #ILMN_1653355 7.01 4.73 6.62 6.27 4.77 6.12 #ILMN_1705025 6.09 6.68 6.80 6.85 8.35 4.15 #ILMN_1814316 5.77 5.17 5.94 6.51 7.12 7.20 

The any( x < qt[1] | x > qt[2] ) function any( x < qt[1] | x > qt[2] ) the apply function (which is used to apply the function to the matrix fields) returns TRUE if any value in this row is outside the range of 20% and 80% of the quantiles of your sample matrix . By definition, if a value is not outside these bounds, it returns FALSE , indicating that we will omit this line in the next line.

+6
source

The Biocondcutor genefilter package provides general filters related to microarray analysis. A typical filter based on row variability would be

 m = matrix(rnorm(47000 * 6), 47000) varFilter(m) 

The batch package landing page contains links to vignettes that illustrate basic operations and provide diagnostic guidance for using filtering.

The basic principle of microarray analysis is that the values ​​in a string are comparable, but not the values ​​between the strings. This is due to the fact that the probes associated with each line have different characteristics, which lead to a shift in the lines - the value in the first line can reasonably indicate a larger, less or uniform expression of the gene compared to the value for the same sample in the second line. This means that @Todd’s desire to normalize based on comparisons between rows (the largest and smallest values ​​in the entire matrix) is not recommended. Instead, varFilter calculates a measure of the variability of each row (spacing between rows of rows) and selects a fraction (var.cutoff argument) with greater variability.

A quick peak in the definition of varFilter shows that in the general case it is no more complicated than with some measure of string variability t22 and a (single) quantile var.cutoff

 vars <- apply(m, 1, var.func) m[vars > quantile(vars, var.cutoff), ] 
+3
source

I am not a statistician, so I don’t know if there is a general method to solve this issue. For me, the problem will be easier if you change your data in a long format.

 library(reshape2) dat.m <- melt(dat) dat.m$value <- as.numeric(dat.m$value) head(dat.m) ID variable value 1 ILMN_1762337 sample1 7.86 2 ILMN_2055271 sample1 5.72 3 ILMN_1736007 sample1 3.82 4 ILMN_2383229 sample1 6.34 5 ILMN_1806310 sample1 6.15 6 ILMN_1653355 sample1 7.01 

Then for each variable you do the following:

  • Calculate limits using quantile
  • remove genes that do not satisfy the condition.

You can do this, for example, using ddply from plyr :

 res <- ddply(dat.m,.(variable),function(x){ ## compute limits for each sample z <- x$value qq <- quantile(z, probs = c(0.2,0.8)) ## keep only genes with high or low variance dd <- x[z < qq[1] | z > qq[2],] }) ## return to the wide format acast(res,ID~variable) sample1 sample2 sample3 sample4 sample5 sample6 ILMN_1653355 7.01 NA 6.62 NA 4.77 NA ILMN_1705025 NA 6.68 6.80 6.85 8.35 4.15 ILMN_1736007 3.82 6.48 NA 7.13 8.20 4.06 ILMN_1762337 7.86 NA 4.89 NA NA NA ILMN_1806310 NA NA NA 5.22 4.59 NA ILMN_1814316 NA NA NA NA NA 7.20 ILMN_2055271 5.72 4.29 4.64 5.00 NA 8.02 ILMN_2383229 NA 4.34 NA NA NA NA 

CHANGE after clarification of the OP, if you want the cutoff values ​​of 20% and 80% for the entire matrix, not only for each individual sample, you calculate qq outside ddply

  qq <- quantile(dat.m$value, probs = c(0.2,0.8)) 

Then you will comment on the corresponding line, for example:

 res <- ddply(dat.m,.(variable),function(x){ z <- x$value ## keep only genes with high or low variance dd <- x[z < qq[1] | z > qq[2],] }) 

PS here:

 dat <- read.table(text=' ID sample1 sample2 sample3 sample4 sample5 sample6 ILMN_1762337 7.86 5.05 4.89 5.74 6.78 6.41 ILMN_2055271 5.72 4.29 4.64 5.00 6.30 8.02 ILMN_1736007 3.82 6.48 6.06 7.13 8.20 4.06 ILMN_2383229 6.34 4.34 6.12 6.83 4.82 5.57 ILMN_1806310 6.15 6.37 5.54 5.22 4.59 6.28 ILMN_1653355 7.01 4.73 6.62 6.27 4.77 6.12 ILMN_1705025 6.09 6.68 6.80 6.85 8.35 4.15 ILMN_1814316 5.77 5.17 5.94 6.51 7.12 7.20 ILMN_1814317 5.97 5.97 5.97 5.97 5.97 5.97 ILMN_1814318 5.97 5.97 5.97 5.97 5.97 5.97 ILMN_1814319 5.97 5.97 5.97 5.97 5.97 5.97',header=TRUE) 
+1
source

All Articles