How should I structure and access a data table so that I can easily compare subsets in Python 3.5?

Question

How should I structure and access a data table so that I can easily compare subsets in Python 3.5?

Is there a faster, more pythonic way to do this?
What is generating this warning UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning and should I be interested in this?

I have a csv file with three columns: org, month, person.

 | org | month | person | | --- | ---------- | ------ | | 1 | 2014-01-01 | 100 | | 1 | 2014-01-01 | 200 | | 1 | 2014-01-02 | 200 | | 2 | 2014-01-01 | 300 |

What I read in pandas.core.frame.DataFrame with:

 data = pd.read_csv('data_base.csv', names=['month', 'org', 'person'], skiprows=1)

The ultimate goal is to compare the intersection of faces between two consecutive periods with the set of faces in the first period.

 org: 1, month: 2014-01-01, count(intersection((100, 200), 200)) / len(set(100, 200)) == 0.5

Edit: I got it to work with:

 import pandas as pd import sys data = pd.read_csv('data_base.csv', names=['month', 'org', 'person'], skiprows=1) data.sort_values(by=['org', 'month', 'person']) results = {} for _org in set(data.org): results[_org] = {} months = sorted(list(set(data[data.org == _org].month))) for _m1, _m2 in zip(months, months[1:]): _s1 = set(data[data.org == _org][data.month == _m1].person) _s2 = set(data[data.org == _org][data.month == _m2].person) results[_org][_m1] = float(len(_s1 & _s2) / len(_s1)) print(str(_org) + '\t' + str(_m1) + '\t' + str(_m2) + '\t' + str(round(results[_org][_m1], 2))) sys.stdout.flush()

Which produces the output as follows:

 UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning 5640 2014-01-01 2014-02-01 0.75 5640 2014-02-01 2014-03-01 0.36 5640 2014-03-01 2014-04-01 0.6 ...

But it is very slow and ugly ... at the current rate, my envelope-based calculation estimates it at about 22 hours for a 2-year batch of data.

+8

python python-3.x pandas dataframe

zhespelt Feb 22 '16 at 19:44

source share

2 answers

I will not necessarily fire pandas here. It depends on a few things. I don’t think pandas will be a really compact way to store your data, although it has automatic compression and sparse storage options that greatly mitigate this. I would expect that the speed would be quite reasonable, but you really need to check it on your data in order to say for sure.

It offers (in my opinion) a more convenient way to store your data, and also offers convenient ways to work with dates. And when you are done, you can output the results in tabular form.

First, I'm going to expand the data a bit to better demonstrate the problems.

  org month person 0 1 2014-01-01 100 1 1 2014-01-01 200 2 1 2014-01-02 200 3 1 2014-01-03 300 4 1 2014-01-03 100 5 1 2014-01-04 200 6 1 2014-01-04 100 7 1 2014-01-04 300 8 2 2014-01-01 100 9 2 2014-01-01 200 10 2 2014-01-02 300 11 2 2014-01-02 400 12 2 2014-01-03 100 13 2 2014-01-04 200 14 2 2014-01-04 100

Then you can do something like this:

 df['one'] = 1 df = df.set_index(['org','month','person']).unstack('person') numer = ((df==df.shift(-1)) & (df.notnull())).sum(axis=1) denom = df.notnull().sum(axis=1) df['numer'] = numer df['denom'] = denom df['ratio'] = numer / denom one numer denom ratio person 100 200 300 400 org month 1 2014-01-01 1 1 NaN NaN 1 2 0.500000 2014-01-02 NaN 1 NaN NaN 0 1 0.000000 2014-01-03 1 NaN 1 NaN 2 2 1.000000 2014-01-04 1 1 1 NaN 2 3 0.666667 2 2014-01-01 1 1 NaN NaN 0 2 0.000000 2014-01-02 NaN NaN 1 1 0 2 0.000000 2014-01-03 1 NaN NaN NaN 1 1 1.000000 2014-01-04 1 1 NaN NaN 0 2 0.000000

I ignore some details here, like a breakpoint between org 1 and org 2, but you can add a group to handle this. Similarly, you can add code to process days without the presence of a person, and there are ways to handle this.

+1

John Feb 23 '16 at 2:48

source share

nneonneo · Accepted Answer · 2016-02-23T00:23:40+0000

Admittedly, I have never used Pandas, so this cannot be idiomatic. It just uses the basic Python structures.

 import collections org_month_dict = collections.defaultdict(set) # put the data into a simple, indexed data structure for index, row in data.iterrows(): org_month_dict[row['org'], row['month']].add(row['person']) orgs = set(data.org) months = sorted(set(data.months)) for org in orgs: for mindex in range(len(months)-1): m1 = months[mindex] m2 = months[mindex+1] print org_month_dict[org, m2] & org_month_dict[org, m1] # persons in common between month 1 and 2

This creates a "cached" lookup table in org_month_dict , which is indexed by organization and month, avoiding the expensive search for data[data.org == _org][data.month == _m1] in the inner loop. It should work much faster than the source code.

How should I structure and access a data table so that I can easily compare subsets in Python 3.5?

More articles: