Search for a maximum length subvector containing a small fraction of 0

I have a vector that contains a sequence of 1 and 0. Suppose it has a length of 166 and it

y <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1, 1,1,1,1,1,0,1,1,0,1,0,1,0,0,0,0,0,1,0,0,0,1,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,0,1,1,0,1,1,1,0,0,0,0,0,1,1,1,1) 

Now I want to extract the LONGEST POSSIBLE sub vector from the above vector so that it satisfies two properties

(1) The sub-vector must start at 1 and end at 1.

(2) It can contain up to 5% of zeros of the total length of a subvector.

I started with the rle function. It calculates 1 and 0 at each step. So it will be like

 z <- rle(y) d <- data.frame(z$values, z$lengths) colnames(d) <- c("value", "length") 

It gives me

 > d value length 1 1 22 2 0 1 3 1 13 4 0 1 5 1 2 6 0 1 7 1 1 8 0 1 9 1 1 10 0 5 11 1 1 12 0 3 13 1 2 14 0 1 15 1 1 16 0 1 17 1 74 18 0 2 19 1 17 20 0 1 21 1 2 22 0 1 23 1 3 24 0 5 25 1 4 

In this case, 74 + 2 + 17 + 1 + 2 + 3 = 99 is a necessary subsequence, since it contains 2 + 1 + 1 = 4 zeros, which is less than 5% of 99. If we move forward and the sequence is 99 + 5 + 4 = 108, and the zeros will be 4 + 5 = 9, which will be more than 5% of 108.

+6
source share
1 answer

I think you are very close when calculating the encoding of this format. It remains only to consider all pairs of runs 1 and select the pair that has the longest length and matches the rule of "not more than 5% zeros." This can be done in a completely vectorized way, using combn to calculate all pairs of runs 1 and cumsum to get the lengths of runs from the rle output:

 ones <- which(d$value == 1) # pairs holds pairs of rows in d that correspond to runs of 1's if (length(ones) >= 2) { pairs <- rbind(t(combn(ones, 2)), cbind(ones, ones)) } else if (length(ones) == 1) { pairs <- cbind(ones, ones) } # Taking cumulative sums of the run lengths enables vectorized computation of the lengths # of each run in the "pairs" matrix cs <- cumsum(d$length) pair.length <- cs[pairs[,2]] - cs[pairs[,1]] + d$length[pairs[,1]] cs0 <- cumsum(d$length * (d$value == 0)) pair.num0 <- cs0[pairs[,2]] - cs0[pairs[,1]] # Multiple the length of a pair by an indicator for whether it valid and take the max selected <- which.max(pair.length * ((pair.num0 / pair.length) <= 0.05)) d[pairs[selected,1]:pairs[selected,2],] # value length # 15 1 1 # 16 0 1 # 17 1 74 # 18 0 2 # 19 1 17 # 20 0 1 # 21 1 2 # 22 0 1 # 23 1 3 

In fact, we found a subvector that is slightly longer than the one found by the OP: it has 102 elements and five 0 (4.90%).

+4
source

All Articles