Find the longest sequence of nonannes, but allow a threshold

Is it possible to find nonzero values ​​of a vector, but also allow n the number of nans? For example, if I have the following data:

X = [18 3 nan nan 8 10 11 nan 9 14 6 1 4 23 24]; %// input array thres = 1; % this is the number of nans to allow 

and I would like to keep only the longest sequence of values ​​with non nans, but allow nn the number of nans in the data. So, say I'm ready to save 1 nand, I will have a way out

 X_out = [8 10 11 nan 9 14 6 1 4 23 24]; %// output array 

Thats is, the two nans at the beginning were deleted due to the fact that they exceed the values ​​in 'thres' above, but the third nan by itself, so it can be stored in the data. I would like to develop a method where thres can be defined as any value.

I can find non nan values ​​with

 Y = ~isnan(X); %// convert to zeros and ones 

Any ideas?

+8
matlab
source share
4 answers

To find the longest sequence containing no more than threshold times NaN , we must find the beginning and end of the specified sequence.

To create all possible starting points, we can use hankel :

 H = hankel(X) H = 18 3 NaN NaN 8 10 11 NaN 9 14 6 1 4 23 24 3 NaN NaN 8 10 11 NaN 9 14 6 1 4 23 24 0 NaN NaN 8 10 11 NaN 9 14 6 1 4 23 24 0 0 NaN 8 10 11 NaN 9 14 6 1 4 23 24 0 0 0 8 10 11 NaN 9 14 6 1 4 23 24 0 0 0 0 10 11 NaN 9 14 6 1 4 23 24 0 0 0 0 0 11 NaN 9 14 6 1 4 23 24 0 0 0 0 0 0 NaN 9 14 6 1 4 23 24 0 0 0 0 0 0 0 9 14 6 1 4 23 24 0 0 0 0 0 0 0 0 14 6 1 4 23 24 0 0 0 0 0 0 0 0 0 6 1 4 23 24 0 0 0 0 0 0 0 0 0 0 1 4 23 24 0 0 0 0 0 0 0 0 0 0 0 4 23 24 0 0 0 0 0 0 0 0 0 0 0 0 23 24 0 0 0 0 0 0 0 0 0 0 0 0 0 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Now we need to find the last valid element in each row. For this we can use cumsum :

 C = cumsum(isnan(H),2) C = 0 0 1 2 2 2 2 3 3 3 3 3 3 3 3 0 1 2 2 2 2 3 3 3 3 3 3 3 3 3 1 2 2 2 2 3 3 3 3 3 3 3 3 3 3 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

The endpoint for each line is the one where the corresponding element in C at most threshold :

 threshold = 1; T = C<=threshold T = 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

The last valid item was found with:

 [~,idx]=sort(T,2); lastone=idx(:,end) lastone = 3 2 1 4 15 15 15 15 15 15 15 15 15 15 15 

We need to make sure that the actual length of each line is respected:

 lengths = length(X):-1:1; real_length = min(lastone,lengths); [max_length,max_idx] = max(real_length) max_length = 11 max_idx = 5 

If there are more sequences of equal maximum length, we just take the first and display it:

 selected_max_idx = max_idx(1); H(selected_max_idx, 1:max_length) ans = 8 10 11 NaN 9 14 6 1 4 23 24 

full script

 X = [18 3 nan nan 8 10 11 nan 9 14 6 1 4 23 24]; H = hankel(X); C = cumsum(isnan(H),2); threshold = 1; T = C<=threshold; [~,idx]=sort(T,2); lastone=idx(:,end)'; lengths = length(X):-1:1; real_length = min(lastone,lengths); [max_length,max_idx] = max(real_length); selected_max_idx = max_idx(1); H(selected_max_idx, 1:max_length) 
+8
source share

Approach 1: Convolution

One possible approach is the convolution Y = double(~isnan(X)); with window n , where n decreases until an acceptable subsequence is found. “Acceptable” means that the subsequence contains at least n-thres units, i.e. Convolution gives at least n-thres .

 Y = double(~isnan(X)); for n = numel(Y):-1:1 %// try all possible sequence lengths w = find(conv(Y,ones(1,n),'valid')>=n-thres); %// is there any acceptable subsequence? if ~isempty(w) break end end result = X(w:w+n-1); 

Aproach 2: Total

Escorting Y with window n (as in approach 1) is equivalent to calculating the total sum of Y , and then taking into account the difference n . This is more efficient in terms of the number of operations.

 Y = double(~isnan(X)); Z = cumsum(Y); for n = numel(Y):-1:1 w = find([Z(n) Z(n+1:end)-Z(1:end-n)]>=n-thres); if ~isempty(w) break end end result = X(w:w+n-1); 

Approach 3: 2D convolution

This essentially computes all iterations of the loop in approach 1 at once.

 Y = double(~isnan(X)); z = conv2(Y, tril(ones(numel(Y)))); [nn, ww] = find(bsxfun(@ge, z, (1:numel(Y)).'-thres)); %' [n, ind] = max(nn); w = ww(ind)-n+1; result = X(w:w+n-1); 
+5
source share

Try my favorite tool: RLE. Matlab does not have a direct function, so use my "seqle" to exchange centerpieces. By default, Seqle returns the reverse encoding. So:

 >> foo = [ nan 1 2 3 nan nan 4 5 6 nan 5 5 5 ]; >> seqle(isnan(foo)) ans = run: [1 3 2 3 1 3] val: [1 0 1 0 1 0] 

"Run" indicates the length of the current run; "val" indicates the value. In this case, val==1 indicates that nan and val==0 indicate numeric values. You can see that it will be quite easy to extract the longest sequence of "run" values ​​satisfying the condition val==0 | run < 2 val==0 | run < 2 to get no more than one nan per line. Then just take the cumulative indices of this run and that subset of foo you want.

EDIT: Unfortunately, what is trivial to find by eye may not be so easy to extract from the code. I suspect that there is a much faster way to use indexes identified by longrun to get the desired subsequence.

 >> foo = [ nan 1 2 3 nan nan 4 5 6 nan nan 5 5 nan 5 nan 4 7 4 nan ]; >> sfoo= seqle(isnan(foo)) sfoo = run: [1 3 2 3 2 2 1 1 1 3 1] val: [1 0 1 0 1 0 1 0 1 0 1] >> longrun = sfoo.run<2 |sfoo.val==0 longlong = run: [2 1 1 1 6] val: [1 0 1 0 1] % longrun identifies which indices might be part of a run % longlong identifies the longest sequence of valid run >> longlong = seqle(longrun) >> lfoo = find(sfoo.run<2 |sfoo.val==0); >> sbar = seqle(lfoo,1); >> maxind=find(sbar.run==max(sbar.run),1,'first'); >> getlfoo = lfoo( sum(sbar.run(1:(maxind-1)))+1 ); % first value in longrun , which is part of max run % getbar finds end of run indices >> getbar = getlfoo:(getlfoo+sbar.run(maxind)-1); >> getsbar = sfoo.run(getbar); % retrieve indices of input vector >> startit = sum(sfoo.run(1:(getbar(1)-1))) +1; >> endit = startit+ ((sum(sfoo.run(getbar(1):getbar(end ) ) ) ) )-1; >> therun = foo( startit:endit ) therun = 5 5 NaN 5 NaN 4 7 4 NaN 
+1
source share

Hmmm, who doesn't like problems, my solution is not as good as mss, but it is an alternative.

 X = [18 3 nan nan 8 10 11 nan 9 14 6 1 4 23 24]; %// input array thresh =1; X(isnan(X))= 0 ; for i = 1:thresh Y(i,:) = circshift(X',-i); %//circular shift end 

For some reason, the Matlab "'" conversion makes formatting weird.

 D = X + sum(Y,1); Discard = find(D==0)+thresh; %//give you the index of the part that needs to be discarded chunk = find(X==0); %//Segment the Vector into segments delimited by NaNs seriesOfZero = circshift(chunk',-1)' - chunk; bigchunk =[1 chunk( find(seriesOfZero ~= 1)) size(X,2)]; %//Convert series of NaNs into 1 chunk [values,DiscardChunk] = intersect(bigchunk,Discard); DiscardChunk = sort(DiscardChunk,'descend') for t = 1:size(DiscardChunk,2) X(bigchunk(DiscardChunk(t)-1):bigchunk(DiscardChunk(t))) = []; %//Discard the data end X(X == 0) = NaN %//End of Code 

8 10 11 NaN 9 14 6 1 4 23 24

When: X = [18 3 nan nan nan 8 10 11 nan nan 9 14 6 1 nan nan nan 4 23 24]; % // input array thresh = 2;

8 10 11 NaN 4 23 24

0
source share

All Articles