- Is there a faster, more pythonic way to do this?
- What is generating this warning
UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning and should I be interested in this?
I have a csv file with three columns: org, month, person.
| org | month | person | | --- | ---------- | ------ | | 1 | 2014-01-01 | 100 | | 1 | 2014-01-01 | 200 | | 1 | 2014-01-02 | 200 | | 2 | 2014-01-01 | 300 |
What I read in pandas.core.frame.DataFrame with:
data = pd.read_csv('data_base.csv', names=['month', 'org', 'person'], skiprows=1)
The ultimate goal is to compare the intersection of faces between two consecutive periods with the set of faces in the first period.
org: 1, month: 2014-01-01, count(intersection((100, 200), 200)) / len(set(100, 200)) == 0.5
Edit: I got it to work with:
import pandas as pd import sys data = pd.read_csv('data_base.csv', names=['month', 'org', 'person'], skiprows=1) data.sort_values(by=['org', 'month', 'person']) results = {} for _org in set(data.org): results[_org] = {} months = sorted(list(set(data[data.org == _org].month))) for _m1, _m2 in zip(months, months[1:]): _s1 = set(data[data.org == _org][data.month == _m1].person) _s2 = set(data[data.org == _org][data.month == _m2].person) results[_org][_m1] = float(len(_s1 & _s2) / len(_s1)) print(str(_org) + '\t' + str(_m1) + '\t' + str(_m2) + '\t' + str(round(results[_org][_m1], 2))) sys.stdout.flush()
Which produces the output as follows:
UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning 5640 2014-01-01 2014-02-01 0.75 5640 2014-02-01 2014-03-01 0.36 5640 2014-03-01 2014-04-01 0.6 ...
But it is very slow and ugly ... at the current rate, my envelope-based calculation estimates it at about 22 hours for a 2-year batch of data.