Outlier detection using k-mean algorithm

I hope you can help me with my problem. I am trying to detect outliers using kmeans algorithm. First, I execute the algorithm and select those objects as possible outliers that have a large distance to their cluster center. Instead of using absolute distance, I want to use relative distance, i.e. The ratio of the absolute distance of an object to the center of the cluster and the average distance of all objects in the cluster to their center of the cluster. The absolute distance emission detection code is as follows:

# remove species from the data to cluster iris2 <- iris[,1:4] kmeans.result <- kmeans(iris2, centers=3) # cluster centers kmeans.result$centers # calculate distances between objects and cluster centers centers <- kmeans.result$centers[kmeans.result$cluster, ] distances <- sqrt(rowSums((iris2 - centers)^2)) # pick top 5 largest distances outliers <- order(distances, decreasing=T)[1:5] # who are outliers print(outliers) 

But how can I use relative instead of absolute distance to find outliers?

+7
source share
2 answers

You just need to calculate the average distance that each observation comes from its cluster. You already have these distances, so you just need to average them. Then the rest is a simple indexed division:

 # calculate mean distances by cluster: m <- tapply(distances, kmeans.result$cluster,mean) # divide each distance by the mean for its cluster: d <- distances/(m[kmeans.result$cluster]) 

Your emissions:

 > d[order(d, decreasing=TRUE)][1:5] 2 3 3 1 3 2.706694 2.485078 2.462511 2.388035 2.354807 
+8
source

How would you deal with intersecting points in this scenario? In addition to Thomas's answer, there are various algorithms that take care of relative distance, such as LOF (local emission factor), KNN, CBOF (connection-based emission detection)

0
source

All Articles