The total (number) of value values ​​during an arbitrary time interval

I have a CSV file with timestamps and certain types of events that occurred at this time. I want to count the number of occurrences of certain types of events at 6 minute intervals.

Input data is as follows:

date,type "Sep 22, 2011 12:54:53.081240000","2" "Sep 22, 2011 12:54:53.083493000","2" "Sep 22, 2011 12:54:53.084025000","2" "Sep 22, 2011 12:54:53.086493000","2" 

I load and cure data using this piece of code:

 > raw_data <- read.csv('input.csv') > cured_dates <- c(strptime(raw_data$date, '%b %d, %Y %H:%M:%S', tz="CEST")) > cured_data <- data.frame(cured_dates, c(raw_data$type)) > colnames(cured_data) <- c('date', 'type') 

After cure, the data is as follows:

 > head(cured_data) date type 1 2011-09-22 14:54:53 2 2 2011-09-22 14:54:53 2 3 2011-09-22 14:54:53 2 4 2011-09-22 14:54:53 2 5 2011-09-22 14:54:53 1 6 2011-09-22 14:54:53 1 

I read a lot of samples for xts and zoo, but for some reason I can not hang on it. The output should look something like this:

 date type count 2011-09-22 14:54:00 CEST 1 11 2011-09-22 14:54:00 CEST 2 19 2011-09-22 15:00:00 CEST 1 9 2011-09-22 15:00:00 CEST 2 12 2011-09-22 15:06:00 CEST 1 23 2011-09-22 15:06:00 CEST 2 18 

The Zoo aggregation function looks promising, I found this piece of code:

 # aggregate POSIXct seconds data every 10 minutes tt <- seq(10, 2000, 10) x <- zoo(tt, structure(tt, class = c("POSIXt", "POSIXct"))) aggregate(x, time(x) - as.numeric(time(x)) %% 600, mean) 

Now I'm just wondering how I can apply this in my use case.

Naive, as I tried:

 > zoo_data <- zoo(cured_data$type, structure(cured_data$time, class = c("POSIXt", "POSIXct"))) > aggr_data = aggregate(zoo_data$type, time(zoo_data$time), - as.numeric(time(zoo_data$time)) %% 360, count) Error in `$.zoo`(zoo_data, type) : not possible for univariate zoo series 

I have to admit that I'm not sure about R, but I try. :-)

I'm a little lost. Can someone point me in the right direction?

Thanks a lot! Hi Alex.

Here's the dput output for a small subset of my data. The data itself is about 80 million rows.

 structure(list(date = structure(c(1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885), class = c("POSIXct", "POSIXt"), tzone = ""), type = c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L)), .Names = c("date", "type"), row.names = c(NA, -23L), class = "data.frame") 
+3
source share
2 answers

We can read it with read.csv , convert the first column to a date bound to six-minute intervals, and add a dummy column to 1. Then re-read it using the read.zoo on type and aggregation in the dummy column:

 # test data Lines <- 'date,type "Sep 22, 2011 12:54:53.081240000","2" "Sep 22, 2011 12:54:53.083493000","2" "Sep 22, 2011 12:54:53.084025000","2" "Sep 22, 2011 12:54:53.086493000","2" "Sep 22, 2011 12:54:53.081240000","3" "Sep 22, 2011 12:54:53.083493000","3" "Sep 22, 2011 12:54:53.084025000","3" "Sep 22, 2011 12:54:53.086493000","4"' library(zoo) library(chron) # convert to chron and bin into 6 minute bins using trunc # Also add a dummy column of 1 # and remove any leading space (removing space not needed if there is none) DF <- read.csv(textConnection(Lines), as.is = TRUE) fmt <- '%b %d, %Y %H:%M:%S' DF <- transform(DF, dummy = 1, date = trunc(as.chron(sub("^ *", "", date), format = fmt), "00:06:00")) # split and aggregate z <- read.zoo(DF, split = 2, aggregate = length) 

With the above test data, the solution is as follows:

 > z 2 3 4 (09/22/11 12:54:00) 4 3 1 

Note that the above was done in broad form, since this form is a time series, while the long form does not. There is one column for each type. In our test data, we had types 2, 3, and 4, so there are three columns.

(We used chron here because its trunc method trunc well with binning in 6-minute groups. Chron does not support time zones, which can be an advantage since you cannot make one of the many possible time zone errors, but if you want, so that POSIXct in any case converts it to the end, for example time(z) <- as.POSIXct(paste(as.Date.dates(time(z)), times(time(z)) %% 1)) . This expression shown in the table in one of R News as.Date.dates 1's articles, except that we used as.Date.dates instead of just as.Date to work around the error that seems to have been introduced since then. could also use time(z) <- as.POSIXct(time(z)) , but that would lead to a different time zone.)

EDIT:

The original solution is chained to dates, but then I noticed that you want the bit to be within 6 minutes, so the solution has been redefined.

EDIT:

Revised based on comment.

+3
source

You are almost all the way. All you have to do is create a zoo-isch version of this data and map it to aggregate.zoo code. Since you want to classify both by time and by type, your second argument to aggregate.zoo should be a little more complex, and you want to count the number, and does not mean that you should use length (). I don’t think count is the base function of R or zoo, and the only count function that I see in my workspace comes from pkg: plyr, so I don’t know how well it will play with aggregate.zoo. length works because most people expect vectors to appear, but it often surprises people when working with data.frames. If you don’t get what you want with length , then you should see if NROW (and they are both successful with your data location): you must first set the type argument to the new data object. And it checks that the aggregate / zoo only processes classifiers of the same category, so you need to put in as.vector to remove its zoo-ness:

 with(cured_data, aggregate(as.vector(x), list(type = type, interval=as.factor(time(x) - as.numeric(time(x)) %% 360)), FUN=NROW) ) # interval x #1 2011-09-22 09:24:00 12 #2 2011-09-22 09:24:00 11 

This is an example, modified from where you received the code (an example on SO by WizaRd Dirk): A set (calculation) of values ​​of values ​​according to arbitrary time frames

 tt <- seq(10, 2000, 10) x <- zoo(tt, structure(tt, class = c("POSIXt", "POSIXct"))) aggregate(as.vector(x), by=list(cat=as.factor(x), tms = as.factor(index(x) - as.numeric(index(x)) %% 600)), length) cat tms x 1 1 1969-12-31 19:00:00 26 2 2 1969-12-31 19:00:00 22 3 3 1969-12-31 19:00:00 11 4 1 1969-12-31 19:10:00 17 5 2 1969-12-31 19:10:00 28 6 3 1969-12-31 19:10:00 15 7 1 1969-12-31 19:20:00 17 8 2 1969-12-31 19:20:00 16 9 3 1969-12-31 19:20:00 27 10 1 1969-12-31 19:30:00 8 11 2 1969-12-31 19:30:00 4 12 3 1969-12-31 19:30:00 9 
+2
source

All Articles