The fastest example counting algorithm

What is the fastest way to get a rough estimate of the number of lines of an input file or data stream std out. FYI, this is a probabilistic algorithm, I can not find many examples on the Internet.

Data can be only one or two columns coming from an awk script csv file! Suppose I want aprox group on one of the columns. I would use a database group, but the number of rows was more than 6-7 billion. I would like the first example to approach time from 3 to 4 seconds. Then run the tales or something, after deciding on the previous one. Any ideas on really rough initial groups?

If you can provide an example algorithm in python or java, this will be very helpful.

+4
source share
2 answers

@Ben Allison's answer is a good way if you want to count common strings. Since you mentioned Bayes and the previous one, I will add an answer in this direction to calculate the percentage of different groups. (see my comments on your question. I think if you have an idea in common, and if you want to make groupby , then assessing the percentage of different groups will make more sense).

Recursive Bayesian update:

I will start by assuming that you have only two groups (extensions can be made to work for several groups, see the following explanations for this.), group1 and group2 .

For m group1 from the first n lines (lines) that you processed, we designate the event as M(m,n) . Obviously, you will see nm group2 , because we assume that they are the only two possible groups. So, you know that the conditional probability of the event M(m,n) , taking into account the percentage of group1 ( s ), is determined by the binomial distribution with the test results n . We are trying to evaluate s Bayesian way.

The previously conjugated binomial is a beta distribution. Therefore, for simplicity, we choose Beta(1,1) as the previous one (of course, you can choose your own parameters here for alpha and beta ), which is a uniform distribution over (0,1). Therefore, for this beta distribution, alpha=1 and beta=1 .

The recursive formulas for binome + beta are as follows:

 if group == 'group1': alpha = alpha + 1 else: beta = beta + 1 

The back of s is actually also a beta distribution:

  s^(m+alpha-1) (1-s)^(n-m+beta-1) p(s| M(m,n)) = ----------------------------------- = Beta (m+alpha, n-m+beta) B(m+alpha, n-m+beta) 

where B is a beta function . To report the result of the assessment, you can rely on the beta distribution and variance, where:

 mean = alpha/(alpha+beta) var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1)) 

Python code: groupby.py

So, a few lines of python to process your data from stdin and evaluate the percentage of group1 will look something like this:

 import sys alpha = 1. beta = 1. for line in sys.stdin: data = line.strip() if data == 'group1': alpha += 1. elif data == 'group2': beta += 1. else: continue mean = alpha/(alpha+beta) var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1)) print 'mean = %.3f, var = %.3f' % (mean, var) 

Sample Data

I pass a few lines of data to the code:

 group1 group1 group1 group1 group2 group2 group2 group1 group1 group1 group2 group1 group1 group1 group2 

Estimated Evaluation Result

And here is what I get as a result:

 mean = 0.667, var = 0.056 mean = 0.750, var = 0.037 mean = 0.800, var = 0.027 mean = 0.833, var = 0.020 mean = 0.714, var = 0.026 mean = 0.625, var = 0.026 mean = 0.556, var = 0.025 mean = 0.600, var = 0.022 mean = 0.636, var = 0.019 mean = 0.667, var = 0.017 mean = 0.615, var = 0.017 mean = 0.643, var = 0.015 mean = 0.667, var = 0.014 mean = 0.688, var = 0.013 mean = 0.647, var = 0.013 

The result shows that group 1 is estimated at 64.7% to 15 processed rows (based on our beta version (1.1) earlier). You may notice that the variance continues to decline, because we have more and more observation points.

Several groups

Now, if you have more than two groups, just change the underscore distribution from binomial to polynomial, and then the corresponding conjugate will be Dirichlet earlier . Everything else you make similar changes.

Additional notes

You said you want a rough estimate in 3-4 seconds. In this case, you simply project part of your data and submit the output to the script above, for example,

 head -n100000 YOURDATA.txt | python groupby.py 

What is it. Hope it helps.

+5
source

If it is reasonable to assume that the data is an IID (therefore, in certain parts of the stream there is no bias, such as certain types of records), then simply select and increase the number of samples according to the approximate size.

Take the first million records (this should be recyclable in a couple of seconds). Its size is x units (MB, characters, everything you need). The full stream has size y, where y → x. Now print the calculations for all samples of interest to you x and simply scale them with the coefficient y / * x * for approximate calculations. Example: you want to know approximately how many records column 1 contains with the value v in the full stream. The first million records have a file size of 100 MB, and the total file size is 10 GB. In the first millionth report, 150,000 of them have a value of v for column 1. Thus, you assume that in full 10 GB of records you will see 150,000 * (10,000,000,000,000,000,000,000) = 15,000,000 with this value . Any statistics that you calculate from a sample can simply be scaled by the same factor to get an estimate.

If there is bias in the data, so some records are more or less likely to be in certain places in the file, then you should choose arbitrary samples (or evenly spaced intervals) from a common set. This will provide an unbiased, representative sample, but will likely have a much larger amount of I / O.

+3
source

All Articles