I am trying to recreate a function in python to evaluate the second moment of a data stream.
As stated in the Ulmans book, "Extraction of Massive Datasets," the second point:
Is the sum of the squares m_i s. This time is called the number of surprise because it measures how uneven the distribution of the elements in the stream is.
Where the elements m_i are unique elements in the stream.
For example, with this toy problem \ data stream:
a, b, c, b, d, a, c, d, a, b, d, c, a, a, b
We calculate the second moment as follows:
5^2 + 4^2 + 3^2 + 3^2 = 59
(since "a" occurs 5 times in the data stream, "b" 4 times, etc.)
Since we cannot store the entire data stream in memory, we can use the second moment estimation algorithm:
-- ( AMS), , :
E(n *(2 * X.value − 1))
X , X.value , , , 1
x , .
n , "E" - .
, , "a" 13- , "d" 8- "c" 3-. "b".
a, b, c, b, d, a, c, d, a, b, d, c, a, a, b
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
x x x
:
X.element = "a" X.value = 2
X.element = "c" X.value = 3
X.element = "d" X.value = 2
AMS:
(15*(2 * 2 - 1) + 15*(2 * 3 - 1) + 15*(2 * 2 - 1))/3 = 55
, (59).
, , "" , (1- ) a :
def secondMoment(vector):
mydict = dict()
for el in vector:
if el not in mydict:
mydict[el] = 1
else:
mydict[el] += 1
return (sum([pow(value, 2) for key, value in mydict.items()]))
AMS, :
def AMSestimate(vector):
lenvect = len(vector)
elements = dict()
for el in vector:
if el in elements:
elements[el] += 1
elif random.choice(range(0, 10)) == 0:
elements[el] = 1
lendict = len(elements)
estimateM2 = 0
for key, value in elements.items():
estimateM2 += lenvect * ((2 * value) - 1)
print(lendict)
if lendict > 0:
return estimateM2/lendict
, , (, ), , , , 10000 , , , .
, , , , X.element.
:
[random.choice(string.ascii_letters) for x in range(size)]
\
elif random.choice(range(0, 10)) == 0:
elements[el] = 1
X.element( , AMS)
\ , "" (string.ascii_letters 52 ).