What is wrong with the pearson algorithm from "collective intelligence programming"?

This function is taken from the book "Programming collective intelligence" and it is supposed to calculate the Pearson correlation coefficient for p1 and p2, which should be a number from -1 to 1.

If two critics relate to elements very accurately, the function should return 1 or close to 1.

With real user data, I sometimes get strange results. In the following example, critical data in dataset2 should return 1 - instead, it returns 0.

Does anyone see a mistake?

(This is not a duplicate of What is wrong with this python function from Collective Intelligence Programming )

from __future__ import division
from math import sqrt

def sim_pearson(prefs,p1,p2):
    si={}
    for item in prefs[p1]: 
        if item in prefs[p2]: si[item]=1
    if len(si)==0: return 0
    n=len(si)
    sum1=sum([prefs[p1][it] for it in si])
    sum2=sum([prefs[p2][it] for it in si])
    sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
    sum2Sq=sum([pow(prefs[p2][it],2) for it in si]) 
    pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
    num=pSum-(sum1*sum2/n)
    den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
    if den==0: return 0
    r=num/den
    return r

critics = {
    'user1':{
        'item1': 3,
        'item2': 5,
        'item3': 5,
        },
    'user2':{
        'item1': 4,
        'item2': 5,
        'item3': 5,
        }
}
critics2 = {
    'user1':{
        'item1': 5,
        'item2': 5,
        'item3': 5,
        },
    'user2':{
        'item1': 5,
        'item2': 5,
        'item3': 5,
        }
}
critics3 = {
    'user1':{
        'item1': 1,
        'item2': 3,
        'item3': 5,
        },
    'user2':{
        'item1': 5,
        'item2': 3,
        'item3': 1,
        }
}

print sim_pearson(critics, 'user1', 'user2', )
result: 1.0 (expected)
print sim_pearson(critics2, 'user1', 'user2', )
result: 0 (unexpected)
print sim_pearson(critics3, 'user1', 'user2', )
result: -1 (expected)
+5
4

. 3 . , .. . , , (den ).

+11

wikipedia, , . , , .

, :

def simplified_sim_pearson(p1, p2):
    n = len(p1)
    assert (n != 0)
    sum1 = sum(p1)
    sum2 = sum(p2)
    m1 = float(sum1) / n
    m2 = float(sum2) / n
    p1mean = [(x - m1) for x in p1]
    p2mean = [(y - m2) for y in p2]
    numerator = sum(x * y for x, y in zip(p1mean, p2mean))
    denominator = math.sqrt(sum(x * x for x in p1mean) * sum(y * y for y in p2mean))
    return numerator / denominator if denominator else 0

def sim_pearson(prefs,p1,p2):
    p1 = prefs[p1]
    p2 = prefs[p2]
    si = set(p1.keys()).intersection(set(p2.keys()))
    p1_x = [p1[k] for k in sorted(si)]
    p2_x = [p2[k] for k in sorted(si)]
    return simplified_sim_pearson(p1_x, p2_x)



critics = {
    'user1':{
        'item1': 3,
        'item2': 5,
        'item3': 5,
        },
    'user2':{
        'item1': 4,
        'item2': 5,
        'item3': 5,
        }
}
critics2 = {
    'user1':{
        'item1': 5,
        'item2': 5,
        'item3': 5,
        },
    'user2':{
        'item1': 5,
        'item2': 5,
        'item3': 5,
        }
}
critics3 = {
    'user1':{
        'item1': 1,
        'item2': 3,
        'item3': 5,
        },
    'user2':{
        'item1': 5,
        'item2': 3,
        'item3': 1,
        }
}

print sim_pearson(critics, 'user1', 'user2', )
print sim_pearson(critics2, 'user1', 'user2', )
print sim_pearson(critics3, 'user1', 'user2', )

, Excel - . correl.

+3

. 0 , (, , , ).

( , ) -0.9 < x < 0.09 " ".

0

Correlation does not imply causality. Gotta say that. You need to develop an understanding of correlation statistics. The correlation can be between -1 and 1, and the value 0 falls in this range and is quite a reasonable result. A correlation of 0 means that there is no statistically significant relationship between the two variables. Remember to make statistics of less than 30 samples.

0
source

All Articles