Get variance and standard deviation of two numbers in two different rows / columns with sqlite / PHP

I have a SQLite database with the following structure:

rowid ID startTimestamp endTimestamp subject 1 00:50:c2:63:10:1a 1000 1090 entrance 2 00:50:c2:63:10:1a 1100 1270 entrance 3 00:50:c2:63:10:1a 1300 1310 door1 4 00:50:c2:63:10:1a 1370 1400 entrance . . . 

I prepared sqlfiddle here: http://sqlfiddle.com/#!2/fe8c6/2

Using this SQL query, I can get the average differences between endTime and startTime between one row and the next row, sorted by object and identifier:

 SELECT id, ( MAX(endtimestamp) - MIN(startTimestamp) - SUM(endtimestamp-startTimestamp) ) / (COUNT(*)-1) AS averageDifference FROM table1 WHERE ID = '00:50:c2:63:10:1a' AND subject = 'entrance' GROUP BY id; 

My problem: there is no problem calculating the average value that this query does. But how can I get the standard deviation and variance of these values?

+4
source share
3 answers

First, find the points of interest of interest by joining the table to yourself and grouping by identifier, then find the average values, the deviations as V(x) = E(x^2) - (E(x))^2 and the standard deviation as sqrt(V) gives

 SELECT ID, AVG(diff) AS average, AVG(diff*diff) - AVG(diff)*AVG(diff) AS variance, SQRT(AVG(diff*diff) - AVG(diff)*AVG(diff)) AS stdev FROM (SELECT t1.id, t1.endTimestamp, min(t2.startTimeStamp) - t1.endTimestamp AS diff FROM table1 t1 INNER JOIN table1 t2 ON t2.ID = t1.ID AND t2.subject = t1.subject AND t2.startTimestamp > t1.startTimestamp -- consider only later startTimestamps WHERE t1.subject = 'entrance' GROUP BY t1.id, t1.endTimestamp) AS diffs GROUP BY ID 
+3
source

For formulas that are more complex than a simple summation, you must calculate the actual difference values ​​for each record by looking at the corresponding start points, for example:

 SELECT (SELECT MIN(startTimestamp) FROM table1 AS next WHERE next.startTimestamp > table1.startTimestamp AND ID = '...' ) - endTimestamp AS timeDifference FROM table1 WHERE nextStartTimestamp IS NOT NULL AND ID = '...' 

Then you can use all the difference values ​​to perform the calculations:

 SELECT SUM(timeDifference) / COUNT(*) AS average, AVG(timeDifference) AS moreEfficientAverage, SUM(timeDifference * timeDifference) / COUNT(*) - AVG(timeDifference) * AVG(timeDifference) AS variance FROM (SELECT (SELECT MIN(startTimestamp) FROM table1 AS next WHERE next.startTimestamp > table1.startTimestamp AND next.ID = '...' ) - endTimestamp AS timeDifference FROM table1 WHERE nextStartTimestamp IS NOT NULL AND ID = '...') 
+3
source

Number of points:

  • Your formula for the average is incorrect, the correct formula is SUM(endtimestamp-starttimestamp)/COUNT(endtimestamp) . I have no idea why you have the terms MIN/MAX . COUNT(*) will read NULL strings and give an incorrect result.
  • SQLlite has an avg function that finds the average value.
  • Dispersion formula SUM((endtimestamp-starttimestamp)*(endtimestamp-starttimestamp)) - AVG(endtimestamp-starttimestamp)*AVG(endtimestamp-starttimestamp)
  • The standard deviation is the square root of the variance.

In response to the comment of the author of the question, in order to calculate the variance, the start and end time must be conjugated through self-connection.

The infinity of the lack of the row_number function in SQL lite is a bit inelegant.

 SELECT id, AVG(startTimestamp-endTimestamp) as mean, SUM((startTimestamp-endTimestamp)^2) - AVG(startTimestamp-endTimestamp)^2 as variance, SQRT(SUM((startTimestamp-endTimestamp)^2) - AVG(startTimestamp-endTimestamp)^2) as stDev FROM (SELECT t1.id, t1.endTimestamp, MIN(t2.startTimestamp) as starttimestamp FROM table1 t1 INNER JOIN table1 t2 ON t1.endTimestamp<=t2.startTimestamp GROUP BY t1.id, t1.endTimestamp) t GROUP BY id; 

See SQL Fiddle

+1
source

All Articles