Using multicore in R to analyze GWAS data

I use R to analyze genome association research data. I have about 500,000 potential predictor variables (single nucleotide polymorphisms or SNPs) and you want to check the relationship between each of them and the continuous outcome (in this case, the concentration of low density lipoproteins in the blood).

I already wrote a script that does this without problems. To explain briefly, I have a data object called Data. Each row corresponds to a specific patient in the study. There are columns by age, gender, body mass index (BMI) and LDL concentration in the blood. There are also half a million other columns with SNP data.

I am currently using a for loop to run a linear model a half million times, as shown below:

# Repeat loop half a million times for(i in 1:500000) { # Select the appropriate SNP SNP <- Data[i] # For each iteration, perform linear regression adjusted for age, gender, and BMI and save the result in an object called "GenoMod" GenoMod <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data) # For each model, save the p value and error for each SNP. I save these two data points in columns 1 and 2 of a matrix called "results" results[i,1] <- summary(GenoMod)$coefficients["Geno","Pr(>|t|)"] results[i,2] <- summary(GenoMod)$coefficients["Geno","Estimate"] } 

It all works great. However, I would really like to speed up my analysis. So I experimented with multi-core, DoMC and foreach packages.

My question is: could someone help me adapt this code using the foreach scheme?

I am running a script on a Linux server which apparently has 16 cores. I tried to experiment with the foreach package, and my results using it were comparatively worse, which means that it will take more time to analyze the results using foreach.

For example, I tried to save linear model objects as shown:

 library(doMC) registerDoMC() results <- foreach(i=1:500000) %dopar% { lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data) } 

This takes more than twice as long as using a regular loop. Any tips on how to do this better or faster will be appreciated! I understand that using a parallel version of lapply may be an option, but I don't know how to do it.

All the best

Alex

+7
source share
1 answer

To start the launch: if you are using Linux, you can make the multicore approach contained in the parallel package. Taking into account that you had to configure all this when using, for example, the foreach package, which is no longer needed with this approach. Your code will run on 16 cores by simply doing:

 require(parallel) mylm <- function(i){ SNP <- Data[i] GenoMod <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data) #return the vector c(summary(GenoMod)$coefficients["Geno","Pr(>|t|)"], summary(GenoMod)$coefficients["Geno","Estimate"]) } Out <- mclapply(1:500000, mylm,mc.cores=16) # returns list Result <- do.call(rbind,Out) # make list a matrix 

Here you create a function that returns a vector with the required quantities and applies indexes to this value. I could not verify this, since I do not have access to the data, but it should work.

+7
source

All Articles