Creating multiple datasets and applying a function and outputting multiple datasets

Question

Creating multiple datasets and applying a function and outputting multiple datasets

Here is my problem, just hard for me ...

I want to create several data sets, then apply the function to these data sets and output the corresponding output in one or more data sets (no matter what is possible) ...

In my example, although I need to create a large number of variables and datasets

seed <- round(runif(10)*1000000) datagen <- function(x){ set.seed(x) var <- rep(1:3, c(rep(3, 3))) yvar <- rnorm(length(var), 50, 10) matrix <- matrix(sample(1:10, c(10*length(var)), replace = TRUE), ncol = 10) mydata <- data.frame(var, yvar, matrix) } gdt <- lapply (seed, datagen) # resulting list (I believe is correct term) has 10 dataframes: # gdt[1] .......to gdt[10] # my function, this will perform anova in every component data frames and #output probability coefficients... anovp <- function(x){ ind <- 3:ncol(x) out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]]) pval <- out$coefficients[,4][2] pval <- do.call(rbind,pval) } plist <- lapply (gdt, anovp) Error in gdt[x] : invalid subscript type 'list'

This does not work, I tried different options. But I could not understand ... finally decided to disturb the experts, sorry for that ...

My questions:

(1) Is it possible to deal with this situation in this way, or are there other alternatives for processing such multiple data sets?

(2) If it is correct, how can I do it?

Thank you for your attention, and I will appreciate your help ...

+4

r

jon Sep 04 '11 at 13:38

source share

1 answer

Richie cotton · Accepted Answer · 2011-09-04T14:20:05+0000

You have a basic idea: you must create a list of data frames and then use lapply to apply the function to each element of the list. Unfortunately, there are a few oddities in your code.

It makes no sense to randomly generate a seed, and then set it. You need to use set.seed to play random numbers. Cut lines

 seed <- round(runif(10)*1000000)

and, perhaps,

 set.seed(x)

rep(1:3, c(rep(3, 3))) matches rep(1:3, each = 3) .

Do not call var or matrix variables, ~~because they will mask the names of these functions.~~ since it is confusing.

3:ncol(x) is dangerous. If x has less than 3 columns, this does not do what you think.

... and now the problem you really wanted to solve.

The problem is the line out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]]) .

lapply passes data frames to anovp rather than pointing, so x is a data frame in gdt[x] . What causes the error.

One more thing. While you are rewriting this line, note that lm takes a data argument, so you don't need to do things like gdt$some_column ; you can just reference some_column directly.

EDIT: further tips.

You always use the formula yvar ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 . Since every time every time create it before calling lapply .

 independent_vars <- paste(colnames(gdt[[1]])[-1:-2], collapse = " + ") model_formula <- formula(paste("yvar", independent_vars, sep = " ~ "))

I probably wouldn't worry about the anovp function. Just do

 models <- lapply(gdt, function(data) lm(model_formula, data))

Then enable further call lapply to play with odds if necessary. The next line replicates your anovp code, but will not work, because model$coefficients is a vector (so the sizes are wrong). Tune to get the bit you really want.

 coeffs <- lapply(models, function(model) do.call(rbind, model$coefficients[,4][2]))

Creating multiple datasets and applying a function and outputting multiple datasets

More articles: