Import financial data into Python Pandas using read_csv

I have a .csv with the following structure:

date_begin,date_end,name,name_code,active_accounts,transaction_amount,transaction_count 1/1/2008,1/31/2008,Name_1,1001,"123,456","$7,890,123.45","67,890" 2/1/2008,2/29/2008,Name_1,1001,"43,210","$987,654.32","109,876" 3/1/2008,3/31/2008,Name_1,1001,"485,079","$1,265,789,433.98","777,888" ... 12/1/2008,12/31/2008,Name_1,1001,"87,543","$432,098,987","87,987" 1/1/2008,1/31/2008,Name_2,1002,"268,456","$890,123.45","97,890" 2/1/2008,2/29/2008,Name_2,1002,"53,210","$987,654.32","109,876" ... etc 

I am trying to read them in pandas using the following code:

 import pandas as pd data = pd.read_csv('my_awesome_csv.csv'),parse_dates=[[0,1]], infer_datetime_format=True) 

This works fine, except that I would like to control the data type in each column. When I run the following code in the interpreter, I find that the numbers in quotation marks are not recognized as numbers, nor dollars, nor others.

 In [10]: data.dtypes Out[10]: date_begin_date_end object name object name_code int64 active_accounts object # Problem, I want this to be a number transaction_amount object # Ditto, I want this to be a number (it a dollar amount) transaction_count object # Still a number! dtype: object 

I traced the csv documentation in Pandas , but did not find what I was looking for, declaring the types that are the amounts when they are stored as strings with commas and dollar signs in csv. My ultimate goal here is to do some arithmetic operations on the values ​​in these columns.

Any thoughts?

+6
python import pandas csv
source share
1 answer

You can use vectorized row methods to parse these columns after calling read_csv :

 import pandas as pd import decimal D = decimal.Decimal data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True) for col in ('active_accounts', 'transaction_count'): data[col] = data[col].str.replace(r',', '').astype(int) data['transaction_amount'] = (data['transaction_amount'] .str.replace(r'[^-+\d.]', '').astype(D)) print(data.dtypes) # date_begin_date_end object # name object # name_code int64 # active_accounts int64 # transaction_amount object # transaction_count int64 # dtype: object print(data) 

gives

  date_begin_date_end name name_code active_accounts \ 0 1/1/2008 1/31/2008 Name_1 1001 123456 1 2/1/2008 2/29/2008 Name_1 1001 43210 2 3/1/2008 3/31/2008 Name_1 1001 485079 3 12/1/2008 12/31/2008 Name_1 1001 87543 4 1/1/2008 1/31/2008 Name_2 1002 268456 5 2/1/2008 2/29/2008 Name_2 1002 53210 transaction_amount transaction_count 0 7890123.45 67890 1 987654.32 109876 2 1265789433.98 777888 3 432098987 87987 4 890123.45 97890 5 987654.32 109876 

PS. read_csv has a converters parameter with which you could provide a function to analyze problem columns. These functions are called once for each line. If you have many lines, this may require many functions of Python calls. Column processing using vectorized string methods as shown above should be much faster.

 import pandas as pd import re import decimal D = decimal.Decimal def make_parser(cls): def parse_commas(text): return cls(re.sub(r'[^-+\d.]', '', text)) return parse_commas to_int = make_parser(int) to_decimal = make_parser(D) data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True , converters={4: to_int, 5: to_decimal, 6: to_int}) print(data) 

gives

  date_begin_date_end name name_code active_accounts \ 0 1/1/2008 1/31/2008 Name_1 1001 123456 1 2/1/2008 2/29/2008 Name_1 1001 43210 2 3/1/2008 3/31/2008 Name_1 1001 485079 3 12/1/2008 12/31/2008 Name_1 1001 87543 4 1/1/2008 1/31/2008 Name_2 1002 268456 5 2/1/2008 2/29/2008 Name_2 1002 53210 transaction_amount transaction_count 0 7890123.45 67890 1 987654.32 109876 2 1265789433.98 777888 3 432098987 87987 4 890123.45 97890 5 987654.32 109876 

and the values ​​in the transaction_amount column are decimal. Decimals:

 In [64]: data.loc[0, 'transaction_amount'] Out[64]: Decimal('7890123.45') 
+6
source share

All Articles