Pandas - SQL equivalent

Question

Pandas - SQL equivalent

NOTE. Look for some help on an efficient way to do this, other than a mega connection, and then figure out the difference between dates

I have table1 with country identifier and date (without duplicates of these values), and I want to summarize the information of table2 (which has the country, date, cluster_x and variable count, where cluster_x is cluster_1, cluster_2, cluster_3), so table1 added to each value of the cluster identifier and the total score from table2 , where the date from table2 occurred within 30 days to the date in table1 .

I find it easy in SQL: how to do it in Pandas?

 select a.date,a.country, sum(case when a.date - b.date between 1 and 30 then b.cluster_1 else 0 end) as cluster1, sum(case when a.date - b.date between 1 and 30 then b.cluster_2 else 0 end) as cluster2, sum(case when a.date - b.date between 1 and 30 then b.cluster_3 else 0 end) as cluster3 from table1 a left outer join table2 b on a.country=b.country group by a.date,a.country

EDIT:

Here is a slightly modified example. Let's say this is table1, an aggregate data set with a date, city, cluster, and counter. Below is a query data set (table 2). in this case, we want to summarize the count field from table1 for cluster1, cluster2, cluster3 (actually 100 of them), corresponding to the country identifier, if the date field in table 1 is 30 days before.

So, for example, the first row of the query dataset has the date 2/2/2015 and country 1. In table 1 there is only one row 30 days before, and this is for cluster 2 with the score 2.

Here is a dump of two tables in a CSV:

 date,country,cluster,count 2014-01-30,1,1,1 2015-02-03,1,1,3 2015-01-30,1,2,2 2015-04-15,1,2,5 2015-03-01,2,1,6 2015-07-01,2,2,4 2015-01-31,2,3,8 2015-01-21,2,1,2 2015-01-21,2,1,3

and table2:

 date,country 2015-02-01,1 2015-04-21,1 2015-02-21,2

+8

python pandas

B_miner Apr 19 '16 at 15:59

source share

2 answers

Robert Rodkey · Answer 1 · 2016-04-23T01:32:00+0000

Edit: Oop - it’s a pity that I didn’t see that editing was about joining before submitting. Np, I will leave it as it was a fun practice. Criticism is welcome.

Where table1 and table2 are located in the same directory as this script, in "table1.csv" and "table2.csv", this should work.

I did not get the same result as your 30-day examples - should have hit it before 31 days, but I think the spirit is here:

 import pandas as pd import numpy as np table1_path = './table1.csv' table2_path = './table2.csv' with open(table1_path) as f: table1 = pd.read_csv(f) table1.date = pd.to_datetime(table1.date) with open(table2_path) as f: table2 = pd.read_csv(f) table2.date = pd.to_datetime(table2.date) joined = pd.merge(table2, table1, how='outer', on=['country']) joined['datediff'] = joined.date_x - joined.date_y filtered = joined[(joined.datediff >= np.timedelta64(1, 'D')) & (joined.datediff <= np.timedelta64(31, 'D'))] gb_date_x = filtered.groupby(['date_x', 'country', 'cluster']) summed = pd.DataFrame(gb_date_x['count'].sum()) result = summed.unstack() result.reset_index(inplace=True) result.fillna(0, inplace=True)

My test output:

 ipdb> table1 date country cluster count 0 2014-01-30 00:00:00 1 1 1 1 2015-02-03 00:00:00 1 1 3 2 2015-01-30 00:00:00 1 2 2 3 2015-04-15 00:00:00 1 2 5 4 2015-03-01 00:00:00 2 1 6 5 2015-07-01 00:00:00 2 2 4 6 2015-01-31 00:00:00 2 3 8 7 2015-01-21 00:00:00 2 1 2 8 2015-01-21 00:00:00 2 1 3 ipdb> table2 date country 0 2015-02-01 00:00:00 1 1 2015-04-21 00:00:00 1 2 2015-02-21 00:00:00 2

...

 ipdb> result date_x country count cluster 1 2 3 0 2015-02-01 00:00:00 1 0 2 0 1 2015-02-21 00:00:00 2 5 0 8 2 2015-04-21 00:00:00 1 0 5 0

Maxu · Answer 2 · 2016-04-23T09:04:35+0000

UPDATE:

I think it makes no sense to use pandas to process data that cannot fit into your memory. Of course, there are some tricks on how to deal with this, but it hurts.

If you want to process your data efficiently, you should use a suitable tool for this.

I would recommend taking a closer look at Apache Spark SQL , where you can process distributed data across multiple nodes of the cluster using much more memory / processing power / IO / etc. compared to the approach to the computer / IO / CPU pandas.

Alternatively, you can try to use DBMSs such as Oracle DB (very expensive , especially software licenses !, and their free version is full of restrictions) or free alternatives such as PostgreSQL (I can not say much about it, due to lack of experience) or MySQL (not so powerful compared to Oracle, for example, there is no built-in / transparent solution for dynamic rotation, which you most likely want to use, etc.)

OLD answer:

you can do it this way (please find explanations in the comments in the code):

 # # <setup> # dates1 = pd.date_range('2016-03-15','2016-04-15') dates2 = ['2016-02-01', '2016-05-01', '2016-04-01', '2015-01-01', '2016-03-20'] dates2 = [pd.to_datetime(d) for d in dates2] countries = ['c1', 'c2', 'c3'] t1 = pd.DataFrame({ 'date': dates1, 'country': np.random.choice(countries, len(dates1)), 'cluster': np.random.randint(1, 4, len(dates1)), 'count': np.random.randint(1, 10, len(dates1)) }) t2 = pd.DataFrame({'date': np.random.choice(dates2, 10), 'country': np.random.choice(countries, 10)}) # # </setup> # # merge two DFs by `country` merged = pd.merge(t1.rename(columns={'date':'date1'}), t2, on='country') # filter dates and drop 'date1' column merged = merged[(merged.date <= merged.date1 + pd.Timedelta('30days'))\ & \ (merged.date >= merged.date1) ].drop(['date1'], axis=1) # group `merged` DF by ['country', 'date', 'cluster'], # sum up `counts` for overlapping dates, # reset the index, # pivot: convert `cluster` values to columns, # taking sum of `count` as values, # NaN will be replaced with zeroes # and finally reset the index r = merged.groupby(['country', 'date', 'cluster'])\ .sum()\ .reset_index()\ .pivot_table(index=['country','date'], columns='cluster', values='count', aggfunc='sum', fill_value=0)\ .reset_index() # rename numeric columns to: 'cluster_N' rename_cluster_cols = {x: 'cluster_{0}'.format(x) for x in t1.cluster.unique()} r = r.rename(columns=rename_cluster_cols)

Output (for my datasets):

 In [124]: r Out[124]: cluster country date cluster_1 cluster_2 cluster_3 0 c1 2016-04-01 8 0 11 1 c2 2016-04-01 0 34 22 2 c3 2016-05-01 4 18 36

Pandas - SQL equivalent

More articles: