I searched everywhere for the best method for determining multidimensional outliers using R, but I don't think I have found any plausible approach so far.
We can take aperture data as an example, since my data also contains several fields
data(iris) df <- iris[, 1:4]
First, I use the Mahalanobis distance from the MVN library
library(MVN) result <- mvOutlier(df, qqplot = TRUE, method = "quan") #non-adjusted result <- mvOutlier(df, qqplot = TRUE, method = "adj.quan") #adjusted Mahalonobis distance
Both led to a large number of emissions (50 out of 150 for unadjusted and 49/150 for adjusted), which, I think, needs to be improved. Unfortunately, I cannot find a variable in the mvOutlier method to set a threshold (says that it increases the likelihood that the point is an outlier, so we have a smaller number)
Secondly, I used the outliers library . It is necessary to find one-dimensional emissions. Thus, my plan is to find outliers for each dimension of the data, and these points, which are outliers in all dimensions, are considered outliers of the data set.
library(outliers) result <- scores(df, type="t", prob=0.95)
For this we can establish the probability, but I do not think that it can replace the multidimensional outlier detection.
Some other approaches I've tried
- library (mvoutlier): this only shows the graph. It is difficult to automatically detect outliers. And I donβt know how to add probability to this
- boil distance ( link ): the person said that he used the cook but I donβt think that there is strong academic evidence to prove that this is normal.
r outliers multivariate-testing mahalanobis
Duy bui
source share