I am trying to RMA normalize a specific gene expression dataset with respect to diffuse large B-cell lymphoma using a custom CDF annotation file (chip definition file) with Brainarray .
Unfortunately, the normalized RMA expression matrix is ββall NaNs, and I donβt understand why.
The dataset (GSE31312) is freely available on the GEO website and uses the Affymetrix HG-U133 Plus 2.0 array platform. I use the affy package to perform RMA normalization.
Since the problem is specific to the data set, the following R-code to reproduce the problem is, unfortunately, rather cumbersome (loading 2 GB, 8.8 GB without unpacking).
Set the working directory.
setwd("~/Desktop/GEO")
Download the required packages. Uncomment the installation of packages.
#source("http://bioconductor.org/biocLite.R") #biocLite(pkgs = c("GEOquery", "affy", "AnnotationDbi", "R.utils")) library("GEOquery") # To automatically download the data library("affy") library("R.utils") # For file handling
Download the array data to the working directory.
files <- getGEOSuppFiles("GSE31312")
Disconnect the data in the CEL directory
#Sys.setenv(TAR = '/usr/bin/tar') # For (some) OS X uncommment this line untar(tarfile = "GSE31312/GSE31312_RAW.tar", exdir = "CEL")
Unzip all .gz files
gz.files <- list.files("CEL", pattern = "\\.gz$", ignore.case = TRUE, full.names = TRUE) for (file in gz.files) gunzip(file, skip = TRUE, remove = TRUE) cel.files <- list.files("CEL", pattern = "\\.cel$", ignore.case = TRUE, full.names = TRUE)
Download, install, and download Brainarray Ensembl custom ENSG gene annotation package
download.file(paste0("http://brainarray.mbni.med.umich.edu/Brainarray/", "Database/CustomCDF/18.0.0/ensg.download/", "hgu133plus2hsensgcdf_18.0.0.tar.gz"), destfile = "hgu133plus2hsensgcdf_18.0.0.tar.gz") install.packages("hgu133plus2hsensgcdf_18.0.0.tar.gz", repos = NULL, type = "source") library(hgu133plus2hsensgcdf)
Perform RMA normalization with and without a custom CDF.
affy.rma <- justRMA(filenames = cel.files, verbose = TRUE) ensg.rma <- justRMA(filenames = cel.files, verbose = TRUE, cdfname = "HGU133Plus2_Hs_ENSG")
As you can see, the function returns without warning the expression matrix exprs(ensg.ram) , where all entries in the expression matrix are NaN.
sum(is.nan(exprs(ensg.rma)))
Interestingly, when using standard CDF, there are quite a few exprs(affy.rma) in exprs(affy.rma) .
# Show some NaNs na.rows <- apply(is.na(exprs(affy.rma)), 1, any) exprs(affy.rma)[which(na.rows)[1:10], 1:4]
Actually there should not be any NaNs. I tried to use the expresso function to perform only background correction (without normalization and summation), which also give NaN. Therefore, the problem is apparently related to background correction. However, you might think that the cause is one or more incorrect arrays. Can someone help me track the origin of NaN and get some useful expression values?
Thank you and best regards
Anders
Edit: it seems that one file is a problem (but not quite)
I decided to check what happens if each .CEL file is normalized independently. What actually happens under the RMA hood when justRMA given a single array that I'm not sure about. But I believe that the quantile normalization step becomes a function of identity, and the summation (median Polish) simply stops after the first iteration.
In any case, to perform a one-time normalization of RMA, we run:
ensg <- vector("list", length(cel.files))
Looking at head(X) , it looks like GSM776462.CEL is all NaN. Indeed, it is almost like this:
sum(!is.nan(X$GSM776462.CEL))
Only 18 of 20009 are not NaN. Then I count the amount of NaN appearing elsewhere.
sum(is.na(X[, -grep("GSM776462.CEL", colnames(X))]))
which is 0. Therefore, GSM776462.CEL is the culprit.
But a regular CDF annotation gives no problems:
affy <- justRMA(filenames = "CEL/GSM776462.CEL") any(is.nan(exprs(affy)))
It is also strange that the use of conventional CDF seems to have random scattering of NaN in the expression matrix. I still do not quite understand what to do with this.
Edit2: NaNs disappear if sample is excluded, but not with standard CDF
Why is it worth it when I exclude that the GSM776462.CEL and RMA files are normalized with and without user CDF files, NaN disappear only in the user case CDF.
# Trying with all other CEL than GSM776462.CEL cel.files2 <- grep("GSM776462.CEL", cel.files, invert = TRUE, value = TRUE) affy.rma.no.776462 <- justRMA(filenames = cel.files2) ensg.rma.no.776462 <- justRMA(filenames = cel.files2, verbose = TRUE, cdfname = "HGU133Plus2_Hs_ENSG") sum(is.na(exprs(affy.rma.no.776462)))
Incomprehensible.
Edit3: No "NA" or "NaNs" in the "source data"
For what it's worth, I tried reading the values ββof the source level at the probe level to check if they contain NA or NaN s. The following .CEL files do one after another and check for any missing values.
for (file in cel.files) { affybtch <- suppressWarnings(read.affybatch(filenames = file)) tmp <- exprs(affybtch) cat(file, "done.\n") if (any(is.na(tmp))) stop(paste("NAs or NaNs are present in", file)) }
No NA or NaN .