Implementation of the Alon-Matias-Szegedi algorithm for the second approximation of the moment

Question

Implementation of the Alon-Matias-Szegedi algorithm for the second approximation of the moment

I am trying to recreate a function in python to evaluate the second moment of a data stream.

As stated in the Ulmans book, "Extraction of Massive Datasets," the second point:

Is the sum of the squares m_i s. This time is called the number of surprise because it measures how uneven the distribution of the elements in the stream is.

Where the elements m_i are unique elements in the stream.

For example, with this toy problem \ data stream:

a, b, c, b, d, a, c, d, a, b, d, c, a, a, b

We calculate the second moment as follows:

5^2 + 4^2 + 3^2 + 3^2 = 59

(since "a" occurs 5 times in the data stream, "b" 4 times, etc.)

Since we cannot store the entire data stream in memory, we can use the second moment estimation algorithm:

-- ( AMS), , :

E(n *(2 * X.value − 1))

X , X.value , , , 1 x , .

n , "E" - .

, , "a" 13- , "d" 8- "c" 3-. "b".

a, b, c, b, d, a, c, d, a, b, d, c, a, a, b
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
      x              x              x

:

X.element = "a"   X.value = 2
X.element = "c"   X.value = 3
X.element = "d"   X.value = 2

AMS:

(15*(2 * 2 - 1) + 15*(2 * 3 - 1) + 15*(2 * 2 - 1))/3 = 55

, (59).

, , "" , (1- ) a :

def secondMoment(vector):
    mydict = dict()
    for el in vector:
        if el not in mydict:
            mydict[el] = 1
        else:
            mydict[el] += 1
    return (sum([pow(value, 2) for key, value in mydict.items()]))

AMS, :

def AMSestimate(vector):
    lenvect = len(vector)
    elements = dict()
    for el in vector:
        if el in elements:
            elements[el] += 1
        elif random.choice(range(0, 10)) == 0:
            elements[el] = 1
    # E(n * (2 * x.value - 1))
    lendict = len(elements)
    estimateM2 = 0
    for key, value in elements.items():
        estimateM2 += lenvect * ((2 * value) - 1)
    print(lendict)
    if lendict > 0:
        return estimateM2/lendict

, , (, ), , , , 10000 , , , .

, , , , X.element.

:

[random.choice(string.ascii_letters) for x in range(size)]

\

elif random.choice(range(0, 10)) == 0:
    elements[el] = 1

X.element( , AMS)

\ , "" (string.ascii_letters 52 ).

+4

python random data-mining bigdata data-stream

Nikaidoh 20 . '16 13:06

1

Ami Tavory · Accepted Answer · 2016-03-20T15:20:13+0000

.

,

import random
import string

size = 100000
seq = [random.choice(string.ascii_letters) for x in range(size)]

( collections.Counter):

from collections import Counter

def secondMoment(seq):
    c = Counter(seq)
    return sum(v**2 for v in c.values())

>>> secondMoment(seq)
192436972

, . , . ( ) :

from collections import defaultdict

def AMSestimate(seq, num_samples=10):
    inds = list(range(len(seq)))
    random.shuffle(inds)
    inds = sorted(inds[: num_samples])

    d = {}
    for i, c in enumerate(seq):
        if i in inds and c not in d:
            d[c] = 0
        if c in d:
            d[c] += 1
    return int(len(seq) / float(len(d)) * sum((2 * v - 1) for v in d.values()))

>>> AMSestimate(seq)
171020000

for el in vector:
    if el in elements:
        elements[el] += 1
    elif random.choice(range(0, 10)) == 0:
        elements[el] = 1

() : 0,1

:

    estimateM2 += lenvect * ((2 * value) - 1)

.

Implementation of the Alon-Matias-Szegedi algorithm for the second approximation of the moment

More articles: