Division of irregular time series into regular monthly averages - R

To establish seasonal effects on energy use, I need to reconcile the information on energy use that I get from the billing database with the monthly temperatures.

I work with a billing dataset that has accounts of different lengths and start and end dates, and I would like to get the average monthly amount for each account for each month. For example, I have a billing database that has the following characteristics:

acct amount begin end days 1 2242 11349 2009-10-06 2009-11-04 29 2 2242 12252 2009-11-04 2009-12-04 30 3 2242 21774 2009-12-04 2010-01-08 35 4 2242 18293 2010-01-08 2010-02-05 28 5 2243 27217 2009-10-06 2009-11-04 29 6 2243 117 2009-11-04 2009-12-04 30 7 2243 14543 2009-12-04 2010-01-08 35 

I would like to figure out how to make these somewhat irregular time series (for each account) receive the average amount per day for each month, which depends on each account, so that:

  acct amount begin end days avgamtpday 1 2242 11349 2009-10-01 2009-10-31 31 X 2 2242 12252 2009-11-01 2009-11-30 30 X 3 2242 21774 2009-12-01 2010-12-31 31 X 4 2242 18293 2010-01-01 2010-01-31 31 X 4 2242 18293 2010-02-01 2010-02-28 28 X 5 2243 27217 2009-10-01 2009-10-31 31 X 6 2243 117 2009-11-01 2009-11-30 30 X 7 2243 14543 2009-12-01 2009-12-31 30 X 7 2243 14543 2010-01-01 2010-01-31 31 X 

I am agnostic enough about which tool can do this, since I need to do this only once.

The extra wrinkle is about 150,000 lines, which by most standards is not very large, but large enough to make it difficult to solve the cycle in R. I explored the use of the zoo, xts and tempdisagg packages in R. I started writing a really ugly cycle that would break each account, then created one row for each month in the existing account, and then pressed () to sum by accts and months, but honestly, they could not figure out how to do this effectively.

In MySQL, I tried this:

create or replace v3 view how to select 1 n union all select 1 union all select 1; create or replace view v how to select 1 n from v3 a, v3 b union all select 1; set @n = 0;
drop table if exists calendar; Create a table calendar (dt primary date key)
paste to calendar
select cast ('2008-1-1' + interval @n: = @n + 1 day as the date) as dt from va, vb, vc, vd, ve, v;

select acct, amount, begin, end, billAmtPerDay, sum (billAmtPerDay), MonthAmt, count () Days, amount (billAmtPerDay) / count () AverageAmtPerDay, year (dt), month (dt) FROM (select *, quantity / days billAmtPerDay from bills b internal calendar combining c to dt between start and end and begin <> dt) x group by acct, amount, begin, end, billAmtPerDay, year (dt), month (dt);

But for reasons that I don’t understand, my server does not like this table and it hangs in the internal connection even when I perform different calculations. I am studying to see if it has temporary memory limits.

Thanks!

+6
source share
2 answers

This is where the use of data.table begins:

 billdata <- read.table(text=" acct amount begin end days 1 2242 11349 2009-10-06 2009-11-04 29 2 2242 12252 2009-11-04 2009-12-04 30 3 2242 21774 2009-12-04 2010-01-08 35 4 2242 18293 2010-01-08 2010-02-05 28 5 2243 27217 2009-10-06 2009-11-04 29 6 2243 117 2009-11-04 2009-12-04 30 7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1) require(data.table) DT = as.data.table(billdata) 

First change the type of the begin and end columns to dates. Unlike data.frame, this does not copy the entire data set.

 DT[,begin:=as.Date(begin)] DT[,end:=as.Date(end)] 

Then find the time interval, find the prevailing daily account and population.

 alldays = DT[,seq(min(begin),max(end),by="day")] setkey(DT, acct, begin) DT[CJ(unique(acct),alldays), mean(amount/days,na.rm=TRUE), by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE] acct month V1 1: 2242 2009-10 391.34483 2: 2242 2009-11 406.69448 3: 2242 2009-12 601.43226 4: 2242 2010-01 646.27465 5: 2242 2010-02 653.32143 6: 2243 2009-10 938.51724 7: 2243 2009-11 97.36172 8: 2243 2009-12 375.68065 9: 2243 2010-01 415.51429 10: 2243 2010-02 415.51429 

I think you will find that the prevailing join logic is rather cumbersome in SQL and slower.

I say this hint because it is not quite right. Note line 10 is repeated because account 2243 is not stretched in 2010-02, unlike account 2242. To complete it, you could rbind the last line for each account and use rolltolast instead of roll . Or perhaps create alldays for an account, not for all accounts.

See if the speed is higher and we can go from there.

You will probably encounter a bug in 1.8.2, which was fixed in 1.8.3. I am using v1.8.3.

"Internal" error message when combining a connection containing missing groups and groups by fixed, # 2162. For example: X [Y, .N, c = NonJoinColumn] where Y contains some lines that do not match X. This error may also cause segment malfunction.

Let me know, and we can either get around or upgrade to version 1.8.3 from R-Forge.

Btw, good example data. This expedited the response.


Here is the complete answer mentioned above. I find it difficult to admit that it combines several data.table functions. This should work in 1.8.2, as it happens, but I tested only in 1.8.3.

 DT[ setkey(DT[,seq(begin[1],last(end),by="day"),by=acct]), mean(amount/days,na.rm=TRUE), by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE] acct month V1 1: 2242 2009-10 391.34483 2: 2242 2009-11 406.69448 3: 2242 2009-12 601.43226 4: 2242 2010-01 646.27465 5: 2242 2010-02 653.32143 6: 2243 2009-10 938.51724 7: 2243 2009-11 97.36172 8: 2243 2009-12 375.68065 9: 2243 2010-01 415.51429 
+8
source

Here is one way to do this:

 billdata <- read.table(text=" acct amount begin end days 1 2242 11349 2009-10-06 2009-11-04 29 2 2242 12252 2009-11-04 2009-12-04 30 3 2242 21774 2009-12-04 2010-01-08 35 4 2242 18293 2010-01-08 2010-02-05 28 5 2243 27217 2009-10-06 2009-11-04 29 6 2243 117 2009-11-04 2009-12-04 30 7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1) #First, declare your columns "begin" and "end" as dates: strptime(billdata$begin, format="%Y-%m-%d") -> billdata$begin strptime(billdata$end, format="%Y-%m-%d") -> billdata$end #Then create a column with the amount per day on the billing period: billdata$avg_on_period<-billdata$amount/billdata$days #Then split it into days: temp <- data.frame(acct=c(),month=c(),day=c(), avg=c()) for(i in 1:nrow(billdata)){ X <- billdata[i,] seq(X$begin,X$end,by="day") -> list_day rbind(temp, data.frame(acct=rep(X$acct,length(list_day)), month=format(list_day, "%Y-%m"), day=format(list_day, "%d"), avg=rep(X$avg_on_period, length(list_day)))) -> temp } # And finally merge the different days of the months together: output<-aggregate(temp$avg, by=list(temp$month,temp$acct), FUN=mean) colnames(output) <- c("Month","Account","Average per day") output Month Account Average per day 1 2009-10 2242 391.34483 2 2009-11 2242 406.69448 3 2009-12 2242 595.40000 4 2010-01 2242 645.51964 5 2010-02 2242 653.32143 6 2009-10 2243 938.51724 7 2009-11 2243 97.36172 8 2009-12 2243 364.06250 9 2010-01 2243 415.51429 
+3
source

Source: https://habr.com/ru/post/926072/


All Articles