Large-scale outlier detection

x

Team Date Score A 1-1-2012 80 A 1-2-2012 90 A 1-3-2012 50 A 1-4-2012 40 B 1-1-2012 100 B 1-2-2012 60 B 1-3-2012 30 B 1-4-2012 70 etc 

I need and can turn this data frame into a wide data frame, one row for each command with all the observations and dates as a header:

xx

 Team 1-1-2012 1-2-2012 1-3-2012 1-4-2012 A 80 90 50 40 B 100 60 30 70 

I need to calculate the average and sd for each row, what can I do:

xx

 Team 1-1-2012 1-2-2012 1-3-2012 1-4-2012 mean sd A 80 90 50 40 75 20 B 100 60 30 70 55 10 

Given that I have thousands of rows in an xx data frame. I would like to do the calculations for each cell as follows:

if abs (xx-Mean)> 3 * SD, create the counter column name and increase the value. The idea is to compare each observation with the average and sd, if each observation for this command corresponds to this - abs (xx-Mean)> 3 * SD, increase the counter. After checking each cell, I would like to look at each counter for each team and get the top ten teams that have the highest counter value. I mainly try to detect the biggest outliers. As soon as I get the top 10 command names, I would like to graphically display their time series data on the data frame x.

Hopefully I won’t make it more complicated than it should be. Not sure if R already has a function to perform calculations on each cell. Any ideas how to do this are appreciated?

+4
source share
2 answers

A long-format , data.table approach

 DT <- read.table( 'clipboard', header = T) library(data.table) DT <- as.data.table(DT) DT[, mean.score := mean(Score), by = Team] ## Team Date Score mean.score ## 1: A 1-1-2012 80 65 ## 2: A 1-2-2012 90 65 ## 3: A 1-3-2012 50 65 ## 4: A 1-4-2012 40 65 ## 5: B 1-1-2012 100 65 ## 6: B 1-2-2012 60 65 ## 7: B 1-3-2012 30 65 ## 8: B 1-4-2012 70 65 DT[, sd.score := sd(Score), by = Team] ## Team Date Score mean.score sd.score ## 1: A 1-1-2012 80 65 23.80476 ## 2: A 1-2-2012 90 65 23.80476 ## 3: A 1-3-2012 50 65 23.80476 ## 4: A 1-4-2012 40 65 23.80476 ## 5: B 1-1-2012 100 65 28.86751 ## 6: B 1-2-2012 60 65 28.86751 ## 7: B 1-3-2012 30 65 28.86751 ## 8: B 1-4-2012 70 65 28.86751 DT[, outlier := abs(Score-mean.score) > 3 * sd.score, by = Team] ## Team Date Score mean.score sd.score outlier ## 1: A 1-1-2012 80 65 23.80476 FALSE ## 2: A 1-2-2012 90 65 23.80476 FALSE ## 3: A 1-3-2012 50 65 23.80476 FALSE ## 4: A 1-4-2012 40 65 23.80476 FALSE ## 5: B 1-1-2012 100 65 28.86751 FALSE ## 6: B 1-2-2012 60 65 28.86751 FALSE ## 7: B 1-3-2012 30 65 28.86751 FALSE ## 8: B 1-4-2012 70 65 28.86751 FALSE 

Or in one step

 DT[, outlier := abs(Score-mean(Score)) > 3 * sd(Score), by = Team] 

To add the number of outliers (the sum of the logical variable will be 0.1)

 DT[, sum.outlier := sum(outlier), by = Team] 
+5
source

I would leave your data in a long format and use plyr , data.table or any other split-apply-comb tool to calculate your statistics. This is how I would use plyr for the task:

 #Your data dat <- read.table(text = "Team Date Score A 1-1-2012 80 A 1-2-2012 90 A 1-3-2012 50 A 1-4-2012 40 B 1-1-2012 100 B 1-2-2012 60 B 1-3-2012 30 B 1-4-2012 70", header = TRUE) library(plyr) #Compute mean and sd by team dat <- ddply(dat, .(Team), transform, mean = mean(Score), sd = sd(Score)) #Your outlier threshold dat <- transform(dat, outlier = abs(Score - mean) > 3*sd) #Cumulative sum by team dat <- ddply(dat, .(Team), transform, cumsumOutlier = cumsum(outlier)) 

Gives you this as a result (which doesn't match your example, but apparently your real data):

  Team Date Score mean sd outlier cumsumOutlier 1 A 1-1-2012 80 65 23.80476 FALSE 0 2 A 1-2-2012 90 65 23.80476 FALSE 0 3 A 1-3-2012 50 65 23.80476 FALSE 0 4 A 1-4-2012 40 65 23.80476 FALSE 0 5 B 1-1-2012 100 65 28.86751 FALSE 0 6 B 1-2-2012 60 65 28.86751 FALSE 0 7 B 1-3-2012 30 65 28.86751 FALSE 0 8 B 1-4-2012 70 65 28.86751 FALSE 0 
+6
source

Source: https://habr.com/ru/post/927602/


All Articles