Using dplyr and a broom to calculate kmeans on a training and test suite

Question

Using dplyr and a broom to calculate kmeans on a training and test suite

I use dplyr and a broom to calculate kmeans for my data. My data contains a test and training set of X and Y coordinates and are grouped by some parameter value (lambda in this case):

mds.test = data.frame() for(l in seq(0.1, 0.9, by=0.2)) { new.dist <- run.distance.model(x, y, lambda=l) mds <- preform.mds(new.dist, ndim=2) mds.test <- rbind(mds.test, cbind(mds$space, design[,c(1,3,4,5)], lambda=rep(l, nrow(mds$space)), data="test")) } > head(mds.test) Comp1 Comp2 Transcripts Genes Timepoint Run lambda data 7A_0_AAGCCTAGCGAC -0.06690476 -0.25519106 68125 9324 Day 0 7A 0.1 test 7A_0_AAATGACTGGCC -0.15292848 0.04310200 28443 6746 Day 0 7A 0.1 test 7A_0_CATCTCGTTCTA -0.12529445 0.13022908 27360 6318 Day 0 7A 0.1 test 7A_0_ACCGGCACATTC -0.33015913 0.14647857 23038 5709 Day 0 7A 0.1 test 7A_0_TATGTCGGAATG -0.25826098 0.05424976 22414 5878 Day 0 7A 0.1 test 7A_0_GAAAAAGGTGAT -0.24349387 0.08071162 21907 6766 Day 0 7A 0.1 test

I have a head test data set above, but I also have one called mds.train , which contains my training data coordinates. My ultimate goal here is to run a k-tool for both sets grouped by lambda , then calculate in.ss, between.ss and total.ss for test data in training centers . Thanks to the excellent resource on the broomstick, I can run kmeans for each lambda for the test suite by simply doing the following:

 test.kclusts = mds.test %>% group_by(lambda) %>% do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))

Then I can calculate the centers of this data for each cluster in each lambda:

 test.clusters = test.kclusts %>% group_by(lambda) %>% do(tidy(.$kclust[[1]]))

This is where I am stuck. How to calculate function assignments as shown on the man page (for example, kclusts %>% group_by(k) %>% do(augment(.$kclust[[1]], points.matrix)) ), where my points.matrix is mds.test which is data.frame with length(unique(mds.test$lambda)) times as many lines as it should be? And is there any way to somehow use the centers from the training set to calculate glance() statistics based on test tasks?

Any help would be greatly appreciated! Thanks!

EDIT: update progress. I figured out how to aggregate test / training tasks, but I'm still having problems trying to calculate kmeans statistics from both sets (training assignment in the test center and test assignment in the training centers). Updated code is given below:

 test.kclusts = mds.test %>% group_by(lambda) %>% do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint)))) test.clusters = test.kclusts %>% group_by(lambda) %>% do(tidy(.$kclust[[1]])) test.clusterings = test.kclusts %>% group_by(lambda) %>% do(glance(.$kclust[[1]])) test.assignments = left_join(test.kclusts, mds.test) %>% group_by(lambda) %>% do(augment(.$kclust[[1]], cbind(.$Comp1, .$Comp2))) train.kclusts = mds.train %>% group_by(lambda) %>% do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint)))) train.clusters = train.kclusts %>% group_by(lambda) %>% do(tidy(.$kclust[[1]])) train.clusterings = train.kclusts %>% group_by(lambda) %>% do(glance(.$kclust[[1]])) train.assignments = left_join(train.kclusts, mds.train) %>% group_by(lambda) %>% do(augment(.$kclust[[1]], cbind(.$Comp1, .$Comp2))) test.assignments$data = "test" train.assignments$data = "train" merge.assignments = rbind(test.assignments, train.assignments) merge.assignments %>% filter(., data=='test') %>% group_by(lambda) ... ?

Ive attached a storyline below that illustrates my progress to this point. To repeat, I would like to calculate kmeans statistics (within the sum of a square, the total sum of squares and between the sum of squares) for training data centers on test assignments / coordinates (graphs that the centers turn off):