Iterative or lazy sampling

I am pretty familiar with using reservoir sampling to sample from a set of indefinite length in one pass from the data. One of the limitations of this approach, in my opinion, is that it still requires going through the entire data set before any results can be returned. Conceptually, this makes sense, since it is necessary to allow elements in the entire sequence to replace previously discovered elements to achieve a single pattern.

Is there a way to give some random results before evaluating the entire sequence? I am thinking of a lazy approach that would work well with the large itertools python library. Perhaps this could be done within the framework of a certain error? I would appreciate any feedback on this idea!

To clarify this question a bit, this diagram summarizes my understanding of the trade-offs in memory and streaming of various sampling methods. I want something that falls into the Sample Stream category, where we do not know the length of the population in advance.

enter image description here

It is clear that there seems to be a contradiction in that he does not know the length a priori and is still getting a single sample, since we will most likely shift the sample to the beginning of the population. Is there a way to quantify this bias? Are there any compromises? Does anyone have a smart algorithm to solve this problem?

+7
python algorithm random-sample
source share
2 answers

If you know in advance the total number of elements that will be obtained using an iterable population , you can get the elements of the population sample as they arrive (not only after reaching the goal). If you do not know the size of the population in advance, this is impossible (since the likelihood that any element in the sample cannot be calculated).

Here is a quick generator that does this:

 def sample_given_size(population, population_size, sample_size): for item in population: if random.random() < sample_size / population_size: yield item sample_size -= 1 population_size -= 1 

Please note that the generator gives the elements in the order they appear in the population (not in random order, for example, random.sample or most collector sample codes), so the sample slice will not be a random subsample!

+6
source share

If the population size is known in advance, can you just generate random sample_size "indexes" (in the stream) and use this to make lazy income? You do not need to read the entire stream.

For example, if population_size is 100 and sample_size is 3, you generate a random set of integers from 1 to 100, for example, you get 10, 67 and 72.

Now you get the 10th, 62nd and 72nd elements of the stream and ignore the rest.

I think I do not understand the question.

0
source share

All Articles