All NaNs in RMA GSE31312 Normalization Using Custom Brainarray CDFs

I am trying to RMA normalize a specific gene expression dataset with respect to diffuse large B-cell lymphoma using a custom CDF annotation file (chip definition file) with Brainarray .

Unfortunately, the normalized RMA expression matrix is ​​all NaNs, and I don’t understand why.

The dataset (GSE31312) is freely available on the GEO website and uses the Affymetrix HG-U133 Plus 2.0 array platform. I use the affy package to perform RMA normalization.

Since the problem is specific to the data set, the following R-code to reproduce the problem is, unfortunately, rather cumbersome (loading 2 GB, 8.8 GB without unpacking).

Set the working directory.

 setwd("~/Desktop/GEO") 

Download the required packages. Uncomment the installation of packages.

 #source("http://bioconductor.org/biocLite.R") #biocLite(pkgs = c("GEOquery", "affy", "AnnotationDbi", "R.utils")) library("GEOquery") # To automatically download the data library("affy") library("R.utils") # For file handling 

Download the array data to the working directory.

 files <- getGEOSuppFiles("GSE31312") 

Disconnect the data in the CEL directory

 #Sys.setenv(TAR = '/usr/bin/tar') # For (some) OS X uncommment this line untar(tarfile = "GSE31312/GSE31312_RAW.tar", exdir = "CEL") 

Unzip all .gz files

 gz.files <- list.files("CEL", pattern = "\\.gz$", ignore.case = TRUE, full.names = TRUE) for (file in gz.files) gunzip(file, skip = TRUE, remove = TRUE) cel.files <- list.files("CEL", pattern = "\\.cel$", ignore.case = TRUE, full.names = TRUE) 

Download, install, and download Brainarray Ensembl custom ENSG gene annotation package

 download.file(paste0("http://brainarray.mbni.med.umich.edu/Brainarray/", "Database/CustomCDF/18.0.0/ensg.download/", "hgu133plus2hsensgcdf_18.0.0.tar.gz"), destfile = "hgu133plus2hsensgcdf_18.0.0.tar.gz") install.packages("hgu133plus2hsensgcdf_18.0.0.tar.gz", repos = NULL, type = "source") library(hgu133plus2hsensgcdf) 

Perform RMA normalization with and without a custom CDF.

 affy.rma <- justRMA(filenames = cel.files, verbose = TRUE) ensg.rma <- justRMA(filenames = cel.files, verbose = TRUE, cdfname = "HGU133Plus2_Hs_ENSG") 

As you can see, the function returns without warning the expression matrix exprs(ensg.ram) , where all entries in the expression matrix are NaN.

 sum(is.nan(exprs(ensg.rma))) # == prod(dim(ensg.rma)) == 9964482 

Interestingly, when using standard CDF, there are quite a few exprs(affy.rma) in exprs(affy.rma) .

 # Show some NaNs na.rows <- apply(is.na(exprs(affy.rma)), 1, any) exprs(affy.rma)[which(na.rows)[1:10], 1:4] # A particular bad probeset: exprs(affy.rma)["1553575_at", ] # There are relatively few NaNs in total (but the really should be none) sum(is.na(exprs(affy.rma))) # == 12305 # Probesets of with all NaNs sum(apply(is.na(exprs(affy.rma)), 1, all)) 

Actually there should not be any NaNs. I tried to use the expresso function to perform only background correction (without normalization and summation), which also give NaN. Therefore, the problem is apparently related to background correction. However, you might think that the cause is one or more incorrect arrays. Can someone help me track the origin of NaN and get some useful expression values?

Thank you and best regards

Anders

Edit: it seems that one file is a problem (but not quite)

I decided to check what happens if each .CEL file is normalized independently. What actually happens under the RMA hood when justRMA given a single array that I'm not sure about. But I believe that the quantile normalization step becomes a function of identity, and the summation (median Polish) simply stops after the first iteration.

In any case, to perform a one-time normalization of RMA, we run:

 ensg <- vector("list", length(cel.files)) # Initialize a list names(ensg) <- basename(cel.files) for (file in cel.files) { ensg[[basename(file)]] <- justRMA(filenames = file, verbose = TRUE, cdfname = "HGU133Plus2_Hs_ENSG") cat("File", which(file == cel.files), "done.\n\n") } # Extract the expression matrices for each file and combine them X <- as.data.frame(do.call(cbind, lapply(ensg, exprs))) 

Looking at head(X) , it looks like GSM776462.CEL is all NaN. Indeed, it is almost like this:

 sum(!is.nan(X$GSM776462.CEL)) # == 18 

Only 18 of 20009 are not NaN. Then I count the amount of NaN appearing elsewhere.

 sum(is.na(X[, -grep("GSM776462.CEL", colnames(X))])) # == 0 

which is 0. Therefore, GSM776462.CEL is the culprit.

But a regular CDF annotation gives no problems:

 affy <- justRMA(filenames = "CEL/GSM776462.CEL") any(is.nan(exprs(affy))) # == FALSE 

It is also strange that the use of conventional CDF seems to have random scattering of NaN in the expression matrix. I still do not quite understand what to do with this.

Edit2: NaNs disappear if sample is excluded, but not with standard CDF

Why is it worth it when I exclude that the GSM776462.CEL and RMA files are normalized with and without user CDF files, NaN disappear only in the user case CDF.

 # Trying with all other CEL than GSM776462.CEL cel.files2 <- grep("GSM776462.CEL", cel.files, invert = TRUE, value = TRUE) affy.rma.no.776462 <- justRMA(filenames = cel.files2) ensg.rma.no.776462 <- justRMA(filenames = cel.files2, verbose = TRUE, cdfname = "HGU133Plus2_Hs_ENSG") sum(is.na(exprs(affy.rma.no.776462))) # == 12275 sum(is.na(exprs(ensg.rma.no.776462))) # == 0 

Incomprehensible.

Edit3: No "NA" or "NaNs" in the "source data"

For what it's worth, I tried reading the values ​​of the source level at the probe level to check if they contain NA or NaN s. The following .CEL files do one after another and check for any missing values.

 for (file in cel.files) { affybtch <- suppressWarnings(read.affybatch(filenames = file)) tmp <- exprs(affybtch) cat(file, "done.\n") if (any(is.na(tmp))) stop(paste("NAs or NaNs are present in", file)) } 

No NA or NaN .

+6
source share

All Articles