Multidimensional departure detection using R with probability

I searched everywhere for the best method for determining multidimensional outliers using R, but I don't think I have found any plausible approach so far.

We can take aperture data as an example, since my data also contains several fields

data(iris) df <- iris[, 1:4] #only taking the four numeric fields 

First, I use the Mahalanobis distance from the MVN library

 library(MVN) result <- mvOutlier(df, qqplot = TRUE, method = "quan") #non-adjusted result <- mvOutlier(df, qqplot = TRUE, method = "adj.quan") #adjusted Mahalonobis distance 

Both led to a large number of emissions (50 out of 150 for unadjusted and 49/150 for adjusted), which, I think, needs to be improved. Unfortunately, I cannot find a variable in the mvOutlier method to set a threshold (says that it increases the likelihood that the point is an outlier, so we have a smaller number)

Secondly, I used the outliers library . It is necessary to find one-dimensional emissions. Thus, my plan is to find outliers for each dimension of the data, and these points, which are outliers in all dimensions, are considered outliers of the data set.

 library(outliers) result <- scores(df, type="t", prob=0.95) #t test, probability is 0.95 result <- subset(result, result$Sepal.Length == T & result$Sepal.Width == T & result$Petal.Length == T & result$Petal.Width == T) 

For this we can establish the probability, but I do not think that it can replace the multidimensional outlier detection.

Some other approaches I've tried

  • library (mvoutlier): this only shows the graph. It is difficult to automatically detect outliers. And I don’t know how to add probability to this
  • boil distance ( link ): the person said that he used the cook but I don’t think that there is strong academic evidence to prove that this is normal.
+7
r outliers multivariate-testing mahalanobis
source share
1 answer

I will leave you with these two links, the first is an article about different multidimensional detection detection methods, and the second is about how to implement them in R.

Cook-distance is a valid way to look at the influence that takes place in a datapoint and, as such, to detect distant points. Mahalanobis distance is also used regularly.

For your test example, aperture set is not useful. It is used for classification tasks because it is clearly separated. Your exclusion from 50 data points will get rid of the whole view.

Outlier detection in multidimensional data -

http://www.m-hikari.com/ams/ams-2015/ams-45-48-2015/13manojAMS45-48-2015-96.pdf

implementation of R

http://r-statistics.co/Outlier-Treatment-With-R.html

+3
source share

All Articles