I was happy to work with this code:
z=lapply(filename_list, function(fname){ read.zoo(file=fname,header=TRUE,sep = ",",tz = "") }) xts( do.call(rbind,z) )
until at the end of one file Dirty Data appears:
Open High Low Close Volume 2011-09-20 21:00:00 1.370105 1.370105 1.370105 1.370105 1
and this is at the beginning of the following file:
Open High Low Close Volume 2011-09-20 21:00:00 1.370105 1.371045 1.369685 1.3702 2230
So rbind.zoo
complains about duplicate.
I can not use something like :
y <- x[ ! duplicated( index(x) ), ]
since they are in different objects of the zoo, inside the list. And I cannot use aggregate
, as I suggested here , because it is a list of zoo objects, and not one large zoo object. And I cannot get one large object due to duplicates. Catch-22.
So, when the situation gets tough, tough hack some of the loops (sorry fingerprints and stop, as this still doesn't work):
indexes <- do.call("c", unname(lapply(z, index))) dups=duplicated(indexes) if(any(dups)){ duplicate_timestamps=indexes[dups] for(tix in 1:length(duplicate_timestamps)){ t=duplicate_timestamps[tix] print("We have a duplicate:");print(t) for(zix in 1:length(z)){ if(t %in% index(z[[zix]])){ print(z[[zix]][t]) if(z[[zix]][t]$Volume==1){ print("-->Deleting this one"); z[[zix]][t]=NULL
The bit that I was set to assigns NULL to the zoo line, does not delete it (error in NextMethod ("[<-"): the replacement has a zero length). OK, so I'm making a copy of the filter, without an offensive element ... but I can handle it:
> z[[zix]][!t,] Error in Ops.POSIXt(t) : unary '!' not defined for "POSIXt" objects > z[[zix]][-t,] Error in `-.POSIXt`(t) : unary '-' is not defined for "POSIXt" objects
PS While high-level solutions to my real problem of “duplicating lines in the list of zoo objects” are very welcome, here we are talking about how to remove a line from a zoo object using the POSIXt index object.
A small bit of test data:
list(structure(c(1.36864, 1.367045, 1.370105, 1.36928, 1.37039, 1.370105, 1.36604, 1.36676, 1.370105, 1.367065, 1.37009, 1.370105, 5498, 3244, 1), .Dim = c(3L, 5L), .Dimnames = list(NULL, c("Open", "High", "Low", "Close", "Volume")), index = structure(c(1316512800, 1316516400, 1316520000), class = c("POSIXct", "POSIXt"), tzone = ""), class = "zoo"), structure(c(1.370105, 1.370115, 1.36913, 1.371045, 1.37023, 1.37075, 1.369685, 1.36847, 1.367885, 1.3702, 1.36917, 1.37061, 2230, 2909, 2782), .Dim = c(3L, 5L), .Dimnames = list(NULL, c("Open", "High", "Low", "Close", "Volume")), index = structure(c(1316520000, 1316523600, 1316527200), class = c("POSIXct", "POSIXt"), tzone = ""), class = "zoo"))
UPDATE: Thanks to G. Grothendieck for the line delete solution. In the actual code, I followed the advice of Joshua and GSee to get a list of xts objects instead of a list of zoo objects. So my code has become:
z=lapply(filename_list, function(fname){ xts(read.zoo(file=fname,header=TRUE,sep = ",",tz = "")) }) x=do.call.rbind(z)
(As a note, pay attention to the do.call.rbind
call. This is due to the fact that rbind.xts
has serious memory problems. See https://stackoverflow.com/a/364660/ ... )
Then I delete the duplicates as a step after the process:
dups=duplicated(index(x)) if(any(dups)){ duplicate_timestamps=index(x)[dups] to_delete=x[ (index(x) %in% duplicate_timestamps) & x$Volume<=1] if(nrow(to_delete)>0){