What is wrong with this python function from Collective Intelligence Programming?

Question

What is wrong with this python function from Collective Intelligence Programming?

This is the feature in question. It calculates the Pearson correlation coefficient for p1 and p2, which should be a number from -1 to 1.

When I use this with real user data, it sometimes returns a number greater than 1, as in this example:

def sim_pearson(prefs,p1,p2):
    si={}
    for item in prefs[p1]: 
        if item in prefs[p2]: si[item]=1

    if len(si)==0: return 0

    n=len(si)

    sum1=sum([prefs[p1][it] for it in si])
    sum2=sum([prefs[p2][it] for it in si])

    sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
    sum2Sq=sum([pow(prefs[p2][it],2) for it in si]) 

    pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])

    num=pSum-(sum1*sum2/n)
    den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))

    if den==0: return 0

    r=num/den

    return r

critics = {
    'user1':{
        'item1': 3,
        'item2': 5,
        'item3': 5,
        },

    'user2':{
        'item1': 4,
        'item2': 5,
        'item3': 5,
        }
}

print sim_pearson(critics, 'user1', 'user2', )

1.15470053838

+4

python algorithm pearson

Hobhouse Sep 14 '09 at 19:46

source share

4 answers

, , , , , float,

+2

Nope 14 . '09 19:58

. , n :

n=float(len(si))

+2

chaos 14 . '09 20:02

, , , , :

from math import sqrt

def sim_pearson(p1,p2):
    keys = set(p1) | set(p2)
    n = len(keys)

    a1 = sum(p1[it] for it in keys) / n
    a2 = sum(p2[it] for it in keys) / n

#    print(a1, a2)

    sum1Sq = sum((p1[it] - a1) ** 2 for it in keys)
    sum2Sq = sum((p2[it] - a2) ** 2 for it in keys) 

    num = sum((p1[it] - a1) * (p2[it] - a2) for it in keys)
    den = sqrt(sum1Sq * sum2Sq)

#    print(sum1Sq, sum2Sq, num, den)
    return num / den

critics = {
    'user1':{
        'item1': 3,
        'item2': 5,
        'item3': 5,
        },

    'user2':{
        'item1': 4,
        'item2': 5,
        'item3': 5,
        }
}

assert 0.999 < sim_pearson(critics['user1'], critics['user1']) < 1.0001

print('Your example:', sim_pearson(critics['user1'], critics['user2']))
print('Another example:', sim_pearson({1: 1, 2: 2, 3: 3}, {1: 4, 2: 0, 3: 1}))

, 1.0, (-4/3, 2/3, 2/3) (-2/3, 1/3, 1/3) .

+1

ilya n. 14 . '09 20:12

Greg hewgill · Accepted Answer · 2009-09-14T19:56:51+0000

It looks like you might unexpectedly use integer division. I made the following change and your function returned 1.0:

num=pSum-(1.0*sum1*sum2/n)
den=sqrt((sum1Sq-1.0*pow(sum1,2)/n)*(sum2Sq-1.0*pow(sum2,2)/n))

For more information on the split operator in Python, see PEP 238 . Alternative way to fix your code:

from __future__ import division

What is wrong with this python function from Collective Intelligence Programming?

More articles: