How to accumulate a unique sum of columns by pandas index

I have a pandas DateFrame, df that I created using

df = pd.read_table('sorted_df_changes.txt', index_col=0, parse_dates=True, names=['date', 'rev_id', 'score']) 

which is structured like this:

  page_id score date 2001-05-23 19:50:14 2430 7.632989 2001-05-25 11:53:55 1814033 18.946234 2001-05-27 17:36:37 2115 3.398154 2001-08-04 21:00:51 311 19.386016 2001-08-04 21:07:42 314 14.886722 

date is an index and is of type DatetimeIndex.

Each page_id can be displayed on one or more dates (not unique) and has a size of ~ 1 million. All pages together make up a document. .

I need to get an estimate for the entire document every time in the date, and only count the last estimate for any given page_id.

Example

Data examples

  page_id score date 2001-05-23 19:50:14 1 3 2001-05-25 11:53:55 2 4 2001-05-27 17:36:37 1 5 2001-05-28 19:36:37 1 1 

Solution example

  score date 2001-05-23 19:50:14 3 2001-05-25 11:53:55 7 (3 + 4) 2001-05-27 17:36:37 9 (5 + 4) 2001-05-28 19:36:37 5 (1 + 4) 

The entry for 2 is counted continuously since it is not repeated, but each time id 1 is repeated, a new account replaces the old estimate.

+4
source share
4 answers

Edit

Finally, I found a solution that is not needed for the loop:

 df.score.groupby(df.page_id).transform(lambda s:s.diff().combine_first(s)).cumsum() 

I think I need a for loop:

 from StringIO import StringIO txt = """date,page_id,score 2001-05-23 19:50:14, 1,3 2001-05-25 11:53:55, 2,4 2001-05-27 17:36:37, 1,5 2001-05-28 19:36:37, 1,1 2001-05-28 19:36:38, 3,6 2001-05-28 19:36:39, 3,9 """ df = pd.read_csv(StringIO(txt), index_col=0) def score_sum_py(page_id, scores): from itertools import izip score_sum = 0 last_score = [0]*(np.max(page_id)+1) result = np.empty_like(scores) for i, (pid, score) in enumerate(izip(page_id, scores)): score_sum = score_sum - last_score[pid] + score last_score[pid] = score result[i] = score_sum result.name = "score_sum" return result print score_sum_py(pd.factorize(df.page_id)[0], df.score) 

output:

 date 2001-05-23 19:50:14 3 2001-05-25 11:53:55 7 2001-05-27 17:36:37 9 2001-05-28 19:36:37 5 2001-05-28 19:36:38 11 2001-05-28 19:36:39 14 Name: score_sum 

If the loop in python is slow, you can try converting the two page_id pages first, the scores to a python list, enumerate the list, and calculate using your own integer python, possibly faster.

If speed is important, you can also try cython:

 %%cython cimport cython cimport numpy as np import numpy as np @cython.wraparound(False) @cython.boundscheck(False) def score_sum(np.ndarray[int] page_id, np.ndarray[long long] scores): cdef int i cdef long long score_sum, pid, score cdef np.ndarray[long long] last_score, result score_sum = 0 last_score = np.zeros(np.max(page_id)+1, dtype=np.int64) result = np.empty_like(scores) for i in range(len(page_id)): pid = page_id[i] score = scores[i] score_sum = score_sum - last_score[pid] + score last_score[pid] = score result[i] = score_sum result.name = "score_sum" return result 

Here I use pandas.factorize() to convert page_id to an array in the range 0 and N. where N is the only number of elements in page_id . You can also use dict to cache the last_score of each page_id without using pandas.factorize() .

+3
source

An alternative data structure simplifies this calculation, performance won't be as good as the other answers, but I think it's worth mentioning (mainly because it uses my favorite pandas function ...):

 In [11]: scores = pd.get_dummies(df['page_id']).mul(df['score'], axis=0).where(x!=0, np.nan) In [12]: scores Out[12]: 1 2 3 date 2001-05-23 19:50:14 3 NaN NaN 2001-05-25 11:53:55 NaN 4 NaN 2001-05-27 17:36:37 5 NaN NaN 2001-05-28 19:36:37 1 NaN NaN 2001-05-28 19:36:38 NaN NaN 6 2001-05-28 19:36:39 NaN NaN 9 In [13]: scores.ffill() Out[13]: 1 2 3 date 2001-05-23 19:50:14 3 NaN NaN 2001-05-25 11:53:55 3 4 NaN 2001-05-27 17:36:37 5 4 NaN 2001-05-28 19:36:37 1 4 NaN 2001-05-28 19:36:38 1 4 6 2001-05-28 19:36:39 1 4 9 In [14]: scores.ffill().sum(axis=1) Out[14]: date 2001-05-23 19:50:14 3 2001-05-25 11:53:55 7 2001-05-27 17:36:37 9 2001-05-28 19:36:37 5 2001-05-28 19:36:38 11 2001-05-28 19:36:39 14 
+2
source

Is this what you want? But I think this is a dumb decision.

 In [164]: df['result'] = [df[:i+1].groupby('page_id').last().sum()[0] for i in range(len(df))] In [165]: df Out[165]: page_id score result date 2001-05-23 19:50:14 1 3 3 2001-05-25 11:53:55 2 4 7 2001-05-27 17:36:37 1 5 9 2001-05-28 19:36:37 1 1 5 
+1
source

Here is an intermediate solution that I compiled using the standard library. I would like to see an elegant efficient solution using pandas.

 import csv from collections import defaultdict page_scores = defaultdict(lambda: 0) date_scores = [] # [(date, score)] def get_and_update_score_diff(page_id, new_score): diff = new_score - page_scores[page_id] page_scores[page_id] = new_score return diff # Note: there are some duplicate dates and the file is sorted by date. # Format: 2001-05-23T19:50:14Z, 2430, 7.632989 with open('sorted_df_changes.txt') as f: reader = csv.reader(f, delimiter='\t') first = reader.next() date_string, page_id, score = first[0], first[1], float(first[2]) page_scores[page_id] = score date_scores.append((date_string, score)) for date_string, page_id, score in reader: score = float(score) score_diff = get_and_update_score_diff(page_id, score) if date_scores[-1][0] == date_string: date_scores[-1] = (date_string, date_scores[-1][1] + score_diff) else: date_scores.append((date_string, date_scores[-1][1] + score_diff)) 
0
source

All Articles