Generate random subscriptions from an ordered list that supports ordering

Question

Generate random subscriptions from an ordered list that supports ordering

Consider a problem in which a random subset of k elements, Y, must be selected from X, a list of n elements, where elements in Y should appear in the same order as in X. Selected elements in Y need not be different. One solution is as follows:

for i = 1 to k A[i] = floor(rand * n) + 1 Y[i] = X[A[i]] sort Y according to the ordering of A

However, this has an O (k log k) runtime due to the sort operation. To remove this, there is a temptation

 high_index = n for i = 1 to k index = floor(rand * high_index) + 1 Y[k - i + 1] = X[index] high_index = index

But this gives an explicit offset to the returned list due to uniform index selection. It seems that the solution O (k) is achievable if the indices in the second solution are distributed unevenly. Does anyone know if this is so, and if so, what distribution properties from which marginal indices are derived have?

+7

algorithm sublist

timxyz Apr 11 '12 at 11:37

source share

6 answers

The unrivaled O(n+k) solution is a trivial high-level pseudo-code.

create an empty histogram of size n [initialized by all elements as zeros]
fills it with k uniformly distributed variables in the range. (do k times histogram[inclusiveRand(1,n)]++ )
iterate the original list [A], reducing the elements in the histogram and adding elements to the list of results.

Explanation [edit]:

The idea is to select k elements from n in random order, with a uniform distribution for each and create a histogram .
This histogram now contains for each index i , how many times A[i] appears in the resulting list Y
Now iterate the list A in order, and for each element i insert A[i] into the resulting Y list histogram[i] times.
This ensures that you support the order because you insert items into the order and never return.
It also guarantees an unbiased solution, since for each i, j, K: P(histogram[i]=K) = P(histogram[j]=K) , therefore for each k each element has the same probability to appear in the resulting list k times.

I believe this can be done in O(k) using order statistics [X _(i) ], but I cannot figure it out: \

+1

amit Apr 11 '12 at 12:43

source share

According to your first algorithm, it is enough to sort k uniform random samples [0, 1] in a sorted order.

Let X1, ..., Xk be these samples. Given that Xk = x, the conditional distribution X1, ..., Xk-1 is equal to k - 1 uniform random samples [0, x) in sorted order, so it’s enough to try Xk and recursively.

What is the probability that Xk <X? Each of k independent samples [0, 1) must be less than x, so the answer (cumulative distribution function for Xk) is x ^ k. The sample according to cdf, all we need to do is invert it on a uniform random sample from [0, 1): pow(random(), 1.0 / k) .

Here is the (expected) O (k) algorithm, which I would really consider upon implementation. The idea is to dump samples into k bins, sort each container and concatenate. Here is some untested Python:

 def samples(n, k): bins = [[] for i in range(k)] for i in range(k): x = randrange(n) bins[(x * k) // n].append(x) result = [] for bin in bins: bin.sort() result.extend(bin) return result

Why is it effective on hold? Suppose we use insertion sorting in each box (each bit has the expected size O (1)!). In addition to operations that are O (k), we will pay in proportion to the number of sums of squares of the sizes of the bunker, which is mainly the number of collisions. Since the probability of collision of two samples is not more than 4 / k, and we have pairs of O (k ^ 2) samples, the expected number of collisions is O (k).

I suspect quite strongly that the guarantee of O (k) can be made with a high probability.

+1

oldboy Apr 11 '12 at 13:37

source share

You can use sort sort to sort Y and thus make the sort linear with respect to k. However, for this you need another array of length n. If we assume that you have already highlighted this, you can execute the code that you request arbitrarily many times with O (k) complexity.

The idea is the same as you describe, but I will use another cnt array of size n, which I assume is initialized to 0, and the other "stack" st, which I assume, is empty.

 for i = 1 to k A[i] = floor(rand * n) + 1 cnt[A[i]]+=1 if cnt[A[i]] == 1 // Needed to be able to traverse the inserted elements faster st.push(A[i]) for elem in st for i = 0 to cnt[elem] Y.add(X[elem]) for elem in st cnt[elem] = 0

EDIT: as oldboy mentioned, what I claim in the post is not. I still need to sort st, which might be a little better than the original sentence, but not too much. So this approach will only be good if k is comparable to n, and then we just go through cnt linearly and build Y that way. This st method is not needed:

 for i = 1 to k A[i] = floor(rand * n) + 1 cnt[A[i]]+=1 for i = 1 to k for j = 0 to cnt[i] Y.add(X[i]) cnt[i] =0

0

Ivaylo strandjev Apr 11 '12 at 12:48

source share

The original list X contains n elements. There are 2 ** n possible subscriptions, since each element will or will not appear in the received sublist: each element adds a little to the list of possible subscriptions. You can view this enumeration of a bitword of n bits.

Since you only need sublists with k elements, you are interested in broken words with exactly set k bits.

A practical algorithm can select (or choose not) the first element from X, and then list the rightmost n-1 substring X, taking into account the accumulated number of selected elements. Since list X is processed in order, list Y will also be in order.

0

wildplasser Apr 13 '12 at 10:44

source share

The original list X contains n elements. There are 2 ** n possible subscriptions, since each element will or will not appear in the sublist: each element adds a little to the list of possible subscriptions. You can view this enumeration of a bitword of n bits.

Since you only need sublists with k elements, you are interested in broken words with exactly set k bits. A practical algorithm can select (or choose not) the first element from X, and then list the rightmost n-1 substring X, taking into account the accumulated number of selected elements. Since list X is processed in order, list Y will also be in order.

 #include <stdio.h> #include <string.h> unsigned pick_k_from_n(char target[], char src[], unsigned k, unsigned n, unsigned done); unsigned pick_k_from_n(char target[], char src[] , unsigned k, unsigned n, unsigned done) { unsigned count=0; if (k>n) return 0; if (k==0) { target[done] = 0; puts(target); return 1; } if (n > 0) { count += pick_k_from_n(target, src+1, k, n-1, done); target[done] = *src; count += pick_k_from_n(target, src+1, k-1, n-1, done+1); } return count; } int main(int argc, char **argv) { char result[20]; char *domain = "OmgWtf!"; unsigned cnt ,len, want; want = 3; switch (argc) { default: case 3: domain = argv[2]; case 2: sscanf(argv[1], "%u", &want); case 1: break; } len = strlen(domain); cnt = pick_k_from_n(result, domain, want, len, 0); fprintf(stderr, "Count=%u\n", cnt); return 0; }

Removing recursion remains as an exercise for the reader. Some conclusion:

 plasser@pisbak :~/hiero/src$ ./a.out 3 ABBA BBA ABA ABA ABB Count=4 plasser@pisbak :~/hiero/src$

0

wildplasser Apr 13 '12 at 11:15

source share

timxyz · Accepted Answer · 2012-04-13T09:35:40+0000

For the first index in Y, the distribution of indices in X is given by the formula:

P (x; n, k) = binomial (n - x + k - 2, k - 1) / norm

where biomial denotes the calculation of the binomial coefficient, and the norm is the normalization coefficient equal to the total number of possible subcomputing configurations.

norm = biomial (n + k - 1, k)

So, for k = 5 and n = 10 we have:

norm = 2002
P (x = 0) = 0.357, P (x <= 0) = 0.357
P (x = 1) = 0.245, P (x <= 1) = 0.604
P (x = 2) = 0.165, P (x <= 2) = 0.769
P (x = 3) = 0.105, P (x <= 3) = 0.874
P (x = 4) = 0.063, P (x <= 4) = 0.937
... (we can continue this to x = 10)

We can try the index X of the first element in Y from this distribution (name it x1). Then, the distribution of the second index in Y can be selectively selected using P (x; (n - x1), (k - 1)), etc. For all subsequent indices.

Now I feel that the problem is not solvable in O (k), because in general we cannot try from the distribution described in constant time. If k = 2, then we can solve in constant time using a quadratic formula (since the probability function simplifies to 0.5 (x ^ 2 + x)), but I see no way to extend it to all k (my math isn 't excellent, though).

Generate random subscriptions from an ordered list that supports ordering

More articles: