Choosing between pandas, OOP and dicts classes (Python)

I wrote a program that reads a couple of CSV files (they are not large, several thousand lines each), I do some data cleaning and wrangling, and this is the final structure of each .csv file looks like (fake data is for illustration purposes only).

import pandas as pd data = [[112233, 'Rob', 99], [445566, 'John', 88]] managers = pd.DataFrame(data) managers.columns = ['ManagerId', 'ManagerName', 'ShopId'] print managers ManagerId ManagerName ShopId 0 112233 Rob 99 1 445566 John 88 data = [[99, 'Shop1'], [88, 'Shop2']] shops = pd.DataFrame(data) shops.columns = ['ShopId', 'ShopName'] print shops ShopId ShopName 0 99 Shop1 1 88 Shop2 data = [[99, 2000, 3000, 4000], [88, 2500, 3500, 4500]] sales = pd.DataFrame(data) sales.columns = ['ShopId', 'Year2010', 'Year2011', 'Year2012'] print sales ShopId Year2010 Year2011 Year2012 0 99 2000 3000 4000 1 88 2500 3500 4500 

Then I use the xlsxwriter and reportlab Python packages to create custom Excel worksheets and .pdf reports, iterate data frames. Everything looks great, and all of these packages work really well.

My concern is that I feel that my code is becoming difficult to maintain, as I need to access the same rows of data several times as I do several times.

Let's say I need to get the names of managers who are responsible for stores that had sales of over 1,500 in 2010. My code is filled with such calls:

 managers[managers['ShopId'].isin( sales[sales['Year2010'] > 1500]['ShopId'])]['ManagerName'].values >>> array(['Rob', 'John'], dtype=object) 

I think it’s hard to understand what is happening while reading this line of code. I could create some intermediate variables, but this will add a few lines of code.

How common is it to sacrifice the ideology of normalizing the database and combine all the parts into a single data frame to get a more convenient code? Obviously, there is a drawback in having a single data frame, since it can get confused when trying to combine other data frames that may be needed later. Of course, their merger leads to data redundancy, since the same manager can be assigned to several stores.

 df = managers.merge(sales, how='left', on='ShopId'). merge(shops, how='left', on='ShopId') print df ManagerId ManagerName ShopId Year2010 Year2011 Year2012 ShopName 0 112233 Rob 99 2000 3000 4000 Shop1 1 445566 John 88 2500 3500 4500 Shop2 

At least this call is getting smaller:

 df[df['Year2010'] > 1500]['ManagerName'].values >>> array(['Rob', 'John'], dtype=object) 

Maybe pandas is the wrong tool for this kind of work?

C # developers in the office frowned at me and said that I use classes, but then I will have a bunch of methods like get_manager_sales(managerid) , etc. Iterating over instances of a class for reports also sounds unpleasant, since I will need to do some sorting and indexing (which I get for free with pandas ).

The dictionary will work, but it also makes it difficult to modify existing data, perform merges, etc. The syntax is also not much better.

 data_dict = df.to_dict('records') [{'ManagerId': 112233L, 'ManagerName': 'Rob', 'ShopId': 99L, 'ShopName': 'Shop1', 'Year2010': 2000L, 'Year2011': 3000L, 'Year2012': 4000L}, {'ManagerId': 445566L, 'ManagerName': 'John', 'ShopId': 88L, 'ShopName': 'Shop2', 'Year2010': 2500L, 'Year2011': 3500L, 'Year2012': 4500L}] 

Get the names of the managers who are responsible for stores that sold more than 1,500 in 2010.

 [row['ManagerName'] for row in data_dict if row['Year2010'] > 1500] >>> ['Rob', 'John'] 

In this particular case, with the data I'm working with, should I go completely with pandas or is there another way to write cleaner code using the power of pandas ?

+7
python dictionary oop pandas class
source share
2 answers

I would choose Pandas because it is much faster, has a great and extremely rich API, the source code looks much cleaner and better, etc.

By the way, the following line can be easily rewritten:

 managers[managers['ShopId'].isin(sales[sales['Year2010'] > 1500]['ShopId'])]['ManagerName'].values 

as:

 ShopIds = sales.ix[sales['Year2010'] > 1500, 'ShopId'] managers.query('ShopId in @ShopIds')['ManagerName'].values 

IMO is pretty easy to read and understand

PS you can also save your data in the SQL and use SQL or save it to the HDF repository and use the where parameter - in both cases you can use indexing of search columns

+3
source share

Creating classes that work with data frames is not a good idea, because it hides the fact that you are using a data frame and open the way to very bad decisions (for example, iterating over a file frame using for ).

Solution 1. Denormalize the data. You do not need to save your data in normal form. The normal form is preferable when you need to keep your records throughout the entire database. This is not a database, you do not do constant inserts, updates and deletions. Therefore, just denormalize it and work with one large data framework, as it is clearly more convenient and better suited to your needs.

Solution 2. Use a database. You can dump your data into the SQLite database (pandas has a built-in function for this) and execute all kinds of crazy queries on it. In my personal experience, SQL queries are much more readable than the ones you posted. If you do this analysis regularly and the data structure remains the same, this may be the preferred solution. You can dump data in db and then use SQLAlchemy to work with it.

Solution 3. Create your own datarame. You can inherit from pandas.DataFrame and add special methods to it. You need to delve into the guts of pandas in order to see how to implement these methods. Thus, you can create, for example, custom methods of access to certain parts of the data.

If you don’t know pandas really well, I would go for solutions 1 or 2. If you need more flexibility and data manipulations are different each time, use 1. If you need to perform approximately the same analysis every time, use 2 (especially, if your data analysis code is part of a larger application).

Also, I don't understand why “adding more lines of code” is bad. By breaking a huge single-line insert into many expressions, you do not increase the actual complexity, but reduce the perceived complexity. Maybe all you have to do is just reorganize your code and copy some operations into reusable functions?

+2
source share

All Articles