I wrote a program that reads a couple of CSV files (they are not large, several thousand lines each), I do some data cleaning and wrangling, and this is the final structure of each .csv file looks like (fake data is for illustration purposes only).
import pandas as pd data = [[112233, 'Rob', 99], [445566, 'John', 88]] managers = pd.DataFrame(data) managers.columns = ['ManagerId', 'ManagerName', 'ShopId'] print managers ManagerId ManagerName ShopId 0 112233 Rob 99 1 445566 John 88 data = [[99, 'Shop1'], [88, 'Shop2']] shops = pd.DataFrame(data) shops.columns = ['ShopId', 'ShopName'] print shops ShopId ShopName 0 99 Shop1 1 88 Shop2 data = [[99, 2000, 3000, 4000], [88, 2500, 3500, 4500]] sales = pd.DataFrame(data) sales.columns = ['ShopId', 'Year2010', 'Year2011', 'Year2012'] print sales ShopId Year2010 Year2011 Year2012 0 99 2000 3000 4000 1 88 2500 3500 4500
Then I use the xlsxwriter and reportlab Python packages to create custom Excel worksheets and .pdf reports, iterate data frames. Everything looks great, and all of these packages work really well.
My concern is that I feel that my code is becoming difficult to maintain, as I need to access the same rows of data several times as I do several times.
Let's say I need to get the names of managers who are responsible for stores that had sales of over 1,500 in 2010. My code is filled with such calls:
managers[managers['ShopId'].isin( sales[sales['Year2010'] > 1500]['ShopId'])]['ManagerName'].values >>> array(['Rob', 'John'], dtype=object)
I think it’s hard to understand what is happening while reading this line of code. I could create some intermediate variables, but this will add a few lines of code.
How common is it to sacrifice the ideology of normalizing the database and combine all the parts into a single data frame to get a more convenient code? Obviously, there is a drawback in having a single data frame, since it can get confused when trying to combine other data frames that may be needed later. Of course, their merger leads to data redundancy, since the same manager can be assigned to several stores.
df = managers.merge(sales, how='left', on='ShopId'). merge(shops, how='left', on='ShopId') print df ManagerId ManagerName ShopId Year2010 Year2011 Year2012 ShopName 0 112233 Rob 99 2000 3000 4000 Shop1 1 445566 John 88 2500 3500 4500 Shop2
At least this call is getting smaller:
df[df['Year2010'] > 1500]['ManagerName'].values >>> array(['Rob', 'John'], dtype=object)
Maybe pandas is the wrong tool for this kind of work?
C # developers in the office frowned at me and said that I use classes, but then I will have a bunch of methods like get_manager_sales(managerid) , etc. Iterating over instances of a class for reports also sounds unpleasant, since I will need to do some sorting and indexing (which I get for free with pandas ).
The dictionary will work, but it also makes it difficult to modify existing data, perform merges, etc. The syntax is also not much better.
data_dict = df.to_dict('records') [{'ManagerId': 112233L, 'ManagerName': 'Rob', 'ShopId': 99L, 'ShopName': 'Shop1', 'Year2010': 2000L, 'Year2011': 3000L, 'Year2012': 4000L}, {'ManagerId': 445566L, 'ManagerName': 'John', 'ShopId': 88L, 'ShopName': 'Shop2', 'Year2010': 2500L, 'Year2011': 3500L, 'Year2012': 4500L}]
Get the names of the managers who are responsible for stores that sold more than 1,500 in 2010.
[row['ManagerName'] for row in data_dict if row['Year2010'] > 1500] >>> ['Rob', 'John']
In this particular case, with the data I'm working with, should I go completely with pandas or is there another way to write cleaner code using the power of pandas ?