@Ben Allison's answer is a good way if you want to count common strings. Since you mentioned Bayes and the previous one, I will add an answer in this direction to calculate the percentage of different groups. (see my comments on your question. I think if you have an idea in common, and if you want to make groupby , then assessing the percentage of different groups will make more sense).
Recursive Bayesian update:
I will start by assuming that you have only two groups (extensions can be made to work for several groups, see the following explanations for this.), group1 and group2 .
For m group1 from the first n lines (lines) that you processed, we designate the event as M(m,n) . Obviously, you will see nm group2 , because we assume that they are the only two possible groups. So, you know that the conditional probability of the event M(m,n) , taking into account the percentage of group1 ( s ), is determined by the binomial distribution with the test results n . We are trying to evaluate s Bayesian way.
The previously conjugated binomial is a beta distribution. Therefore, for simplicity, we choose Beta(1,1) as the previous one (of course, you can choose your own parameters here for alpha and beta ), which is a uniform distribution over (0,1). Therefore, for this beta distribution, alpha=1 and beta=1 .
The recursive formulas for binome + beta are as follows:
if group == 'group1': alpha = alpha + 1 else: beta = beta + 1
The back of s is actually also a beta distribution:
s^(m+alpha-1) (1-s)^(n-m+beta-1) p(s| M(m,n)) = ----------------------------------- = Beta (m+alpha, n-m+beta) B(m+alpha, n-m+beta)
where B is a beta function . To report the result of the assessment, you can rely on the beta distribution and variance, where:
mean = alpha/(alpha+beta) var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1))
Python code: groupby.py
So, a few lines of python to process your data from stdin and evaluate the percentage of group1 will look something like this:
import sys alpha = 1. beta = 1. for line in sys.stdin: data = line.strip() if data == 'group1': alpha += 1. elif data == 'group2': beta += 1. else: continue mean = alpha/(alpha+beta) var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1)) print 'mean = %.3f, var = %.3f' % (mean, var)
Sample Data
I pass a few lines of data to the code:
group1 group1 group1 group1 group2 group2 group2 group1 group1 group1 group2 group1 group1 group1 group2
Estimated Evaluation Result
And here is what I get as a result:
mean = 0.667, var = 0.056 mean = 0.750, var = 0.037 mean = 0.800, var = 0.027 mean = 0.833, var = 0.020 mean = 0.714, var = 0.026 mean = 0.625, var = 0.026 mean = 0.556, var = 0.025 mean = 0.600, var = 0.022 mean = 0.636, var = 0.019 mean = 0.667, var = 0.017 mean = 0.615, var = 0.017 mean = 0.643, var = 0.015 mean = 0.667, var = 0.014 mean = 0.688, var = 0.013 mean = 0.647, var = 0.013
The result shows that group 1 is estimated at 64.7% to 15 processed rows (based on our beta version (1.1) earlier). You may notice that the variance continues to decline, because we have more and more observation points.
Several groups
Now, if you have more than two groups, just change the underscore distribution from binomial to polynomial, and then the corresponding conjugate will be Dirichlet earlier . Everything else you make similar changes.
Additional notes
You said you want a rough estimate in 3-4 seconds. In this case, you simply project part of your data and submit the output to the script above, for example,
head -n100000 YOURDATA.txt | python groupby.py
What is it. Hope it helps.