Edit
Finally, I found a solution that is not needed for the loop:
df.score.groupby(df.page_id).transform(lambda s:s.diff().combine_first(s)).cumsum()
I think I need a for loop:
from StringIO import StringIO txt = """date,page_id,score 2001-05-23 19:50:14, 1,3 2001-05-25 11:53:55, 2,4 2001-05-27 17:36:37, 1,5 2001-05-28 19:36:37, 1,1 2001-05-28 19:36:38, 3,6 2001-05-28 19:36:39, 3,9 """ df = pd.read_csv(StringIO(txt), index_col=0) def score_sum_py(page_id, scores): from itertools import izip score_sum = 0 last_score = [0]*(np.max(page_id)+1) result = np.empty_like(scores) for i, (pid, score) in enumerate(izip(page_id, scores)): score_sum = score_sum - last_score[pid] + score last_score[pid] = score result[i] = score_sum result.name = "score_sum" return result print score_sum_py(pd.factorize(df.page_id)[0], df.score)
output:
date 2001-05-23 19:50:14 3 2001-05-25 11:53:55 7 2001-05-27 17:36:37 9 2001-05-28 19:36:37 5 2001-05-28 19:36:38 11 2001-05-28 19:36:39 14 Name: score_sum
If the loop in python is slow, you can try converting the two page_id pages first, the scores to a python list, enumerate the list, and calculate using your own integer python, possibly faster.
If speed is important, you can also try cython:
%%cython cimport cython cimport numpy as np import numpy as np @cython.wraparound(False) @cython.boundscheck(False) def score_sum(np.ndarray[int] page_id, np.ndarray[long long] scores): cdef int i cdef long long score_sum, pid, score cdef np.ndarray[long long] last_score, result score_sum = 0 last_score = np.zeros(np.max(page_id)+1, dtype=np.int64) result = np.empty_like(scores) for i in range(len(page_id)): pid = page_id[i] score = scores[i] score_sum = score_sum - last_score[pid] + score last_score[pid] = score result[i] = score_sum result.name = "score_sum" return result
Here I use pandas.factorize() to convert page_id to an array in the range 0 and N. where N is the only number of elements in page_id . You can also use dict to cache the last_score of each page_id without using pandas.factorize() .