R kmeans (statistics) vs Kmeans (amap)

Hi community stackoverflow,

I am running kmeans (statistics package) and Kmeans (amap package) in the Iris dataset. In both cases, I use the same algorithm (Lloyd-Forgy), the same distance (Euclidean), the same number of initial random sets (50), the same maximum number of iterations (1000), and I test the same set from k values ​​(from 2 to 15). I also use the same seed for both cases (4358).

I do not understand why under these conditions I have different wss curves, in particular: the "elbow" using the statistics package is much less accented than when using the amap package.

Could you help me understand why? Thank you very much!

Here is the code:

# data load and scaling newiris <- iris newiris$Species <- NULL newiris <- scale(newiris) # using kmeans (stats) wss1 <- (nrow(newiris)-1)*sum(apply(newiris,2,var)) for (i in 2:15) { set.seed(4358) wss1[i] <- sum(kmeans(newiris, centers=i, iter.max=1000, nstart=50, algorithm="Lloyd")$withinss) } # using Kmeans (amap) library(amap) wss2 <- (nrow(newiris)-1)*sum(apply(newiris,2,var)) for (i in 2:15) { set.seed(4358) wss2[i] <- sum(Kmeans(newiris, centers=i, iter.max=1000, nstart=50, method="euclidean")$withinss) } # plots plot(1:15, wss1, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares", main="kmeans (stats package)") plot(1:15, wss2, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares", main="Kmeans (amap package)") 

EDIT: I emailed the author of the amap package and will post the answer when / if I receive it. https://cran.r-project.org/web/packages/amap/index.html

+5
source share
1 answer

The author of the amap package changed the code, and the value of the intrass variable is the amount used by the method (for example, Euclidean distance).

One way to solve this problem, given the return of the Kmeans (amap) function, recalculates the value inss (Error Sum of Squares (SSE)).

Here is my suggestion:

# using Kmeans (amap)

  library(amap) wss2 <- (nrow(newiris)-1)*sum(apply(newiris,2,var)) for (i in 2:15) { set.seed(4358) ans.Kmeans <- Kmeans(newiris, centers=i, iter.max=1000, nstart=50, method="euclidean") wss <- vector(mode = "numeric", length=i) for (j in 1:i) { km = as.matrix(newiris[which(ans.Kmeans$cluster %in% j),]) ## average = as.matrix( t(apply(km,2,mean) )) ## wss[j] = sum( apply(km, 1, function(x) sum((x-average) ^ 2 ))) ## or wss[j] <- ( nrow(km)-1) * sum(apply(km,2,var)) } wss2[i] = sum(wss) } 

Note. The pearson method in this package is incorrect (be careful!) On version 0.8-14.

Line 325 according to the code at this link:

https://github.com/cran/amap/blob/master/src/distance_T.inl

+1
source

All Articles