Cubic spline method for given longitudinal rows?

I have serial data formatted as follows:

time milk Animal_ID 30 25.6 1 31 27.2 1 32 24.4 1 33 17.4 1 34 33.6 1 35 25.4 1 33 29.4 2 34 25.4 2 35 24.7 2 36 27.4 2 37 22.4 2 80 24.6 3 81 24.5 3 82 23.5 3 83 25.5 3 84 24.4 3 85 23.4 3 . . . 

As a rule, 300 animals have records of milk at different points in time of a short period. However, if we combine their data and do not care about different animal_IDs, we will have a curve between such milky times, as shown in the figure below: enter image description here In addition, in the above figure, we have data for 1 example of an animal, they are short and vary greatly. Mine is designed to smooth out animal data, but that would be if this model allowed you to include a common dataset from whole data. I used another smooth model (ns, bs, smooth.spline) with the following format, but it just didn't work:

 mod <- lme(milk ~ bs(time, df=3), data=dat, random = ~1|Animal_ID) 

I hope that if someone has already dealt with this problem, I would advise. Thanks Access to the complete dataset can be obtained here: https://www.dropbox.com/s/z9b5teh3su87uu7/dat.txt?dl=0

+6
source share
1 answer

I suggest you use the mgcv package. This is one of the recommended R packages that runs a class of models called generic mixed mixed models . You can simply download it using library(mgcv) . This is a very powerful library that can work with the simplest linear regression model, with generalized linear models, for additive models, with generalized additive models, as well as models with mixed effects (fixed effects + random effects). You can list all the (exported) mgcv functions through

 ls("package:mgcv") 

And you can see that there are many.

For your specific data and problems you can use a model with the formula:

 model <- milk ~ s(time, bs = 'cr', k = 100) + s(Animal_ID, bs = 're') 

In mgcv , s() is the setting for smooth functions, represented by the spline base, implied by bs . "cr" is the cubic spline basis that you want. k is the number of nodes. It should be selected depending on the number of unique values ​​of the time variable in your dataset. If you set k exactly this number, you will get a smoothing spline; while any value is less than the regression spline. However, both will be punished (if you know what punishment means). I read your details in:

 dat <- na.omit(read.csv("data.txt", header = TRUE)) ## I saved you data into file "data.txt" dat$Animal_ID <- factor(dat$Animal_ID) nrow(dat) ## 12624 observations length(unique(dat$time)) ## 157 unique time points length(ID <- levels(dat$Animal_ID)) ## 355 cows 

There are 157 unique values, so I think k = 100 is probably appropriate.

For Animal_ID (forced as a factor), we need a model term for a random effect. "re" is a special class for the random iid effect. It is passed to bs for some internal reason for constructing the matrix (so this is not a smooth function!).

Now, to fit the GAM model, you can call the obsolete gam or the ever-evolving bam (big data gam). I think you will use the latter. They have the same convention as for lm and glm . For example, you can:

 fit <- bam(model, data = dat, family = "gaussian", discrete = TRUE, nthreads = 2) 

As you can see, bam allows multi-core parallel computing via nthreads . Although discrete is a newly developed feature that can accelerate matrix formation.

Since you are dealing with time series data, finally, you can consider some temporal autocorrelation. mgcv allows mgcv to configure the correlation AR1, the correlation coefficient of which is passed by the argument bam rho . However, you need an additional AR_start index to tell mgcv how the time series is broken into pieces. For example, when you reach another Animal_ID , AR_start get a TRUE to indicate a new time series segment. See ?bam more details.

mgcv also provides

  • summary.gam function for model summary
  • gam.check to check the base model
  • plot.gam function for building individual terms
  • predict.gam (or predict.bam ) to predict new data.

For example, a summary of the above model:

 > summary(fit) Family: gaussian Link function: identity Formula: milk ~ s(time, bs = "cr", k = 100) + s(Animal_ID, bs = "re") Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 26.1950 0.2704 96.89 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Ref.df F p-value s(time) 10.81 13.67 5.908 1.99e-11 *** s(Animal_ID) 351.43 354.00 136.449 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.805 Deviance explained = 81.1% fREML = 29643 Scale est. = 5.5681 n = 12624 

edf (effective degree of freedom) can be considered as a measure of the degree of non-linearity. Therefore, we put k = 100 , and ending with edf = 10.81 . This suggests that the s(time) spline was heavily fined. You can see what s(time) looks like by:

 plot.gam(fit, page = 1) 

Note that the random effect s(Animal_ID) also has a β€œsmooth” effect, that is, a constant depending on the cows. For random effects, a Gaussian QQ plot is returned.

Diagnostic data returned

 invisible(gam.check(fit)) 

It looks good, so I think the model is acceptable (I do not offer you a choice of model, so come up with a better model if you think you have one).

If you want to make a prediction for Animal_ID = 26 , you can do

 newd <- data.frame(time = 1:150, Animal_ID = 26) oo <- predict.gam(fit, newd, type = `link`, se.fit = TRUE) 

note that

  • You need to include both variables in newd (otherwise mgcv complains about the missing variable)
  • since you only have one spline smooth s(time) , and the random effect s(Animal_ID) is a constant on Animal_ID . therefore, you can use type = 'link' for individual forecasting. By the way, type = 'terms' slower than type = 'link' .

If you want to make a prediction for more than one cow, try something like this:

 pred.ID <- ID[1:10] ## predict first 10 cows newd <- data.frame (time = rep (1:150, times = n), Animal_ID = factor (rep (pred.ID, each = 150))) oo <- predict.bam (fit, newd, type = "link", se.fit = TRUE) 

note that

  • I used predict.bam here, since now we have 150 * 10 = 1500 data points for forecasting. Plus: we need se.fit = TRUE . It's quite expensive, so use predict.bam faster than predict.gam . In particular, if you set your model using bam(..., discrete = TRUE) , you can have predict.bam(..., discrete = TRUE) . The forecasting process goes through the same steps of matrix formation as in fitting the model (see ?smoothCon , used in fitting the model and ?PredictMat , used in forecasting, if you want to know more internal structure of mgcv .)
  • I indicated Animal_ID as factors because it is a random effect.

More information about mgcv can be found in the library manual. Check specifically ?mgcv ?gam


Final update

Although I said that I would not help you in the model section, but I think this model is better (it gives a higher adj-Rsquared ), and also more reasonable in the sense of:

 model <- milk ~ s(time, bs = 'cr', k = 20) + s(Animal_ID, bs = 're') + s(Animal_ID, time, bs = 're') 

The latter term imposes a random rollback . This implies that we assume that each individual cow has a different growing / decreasing structure for milk production . This is a more reasonable assumption in your problem. An early model with only occasional interception is not enough. After adding this random dump, the smooth term s(time) looks more smoothly. This is a good sign, not a bad sign, because we need a simple explanation of s(time) , right? Compare the s(time) that you get from both models and see what you find.

I also reduced k = 100 to k = 20 . As we saw in the previous approach, edf for this term is about 10, so k = 20 enough.

+5
source

All Articles