Pandas read csv from memory

I am trying to manipulate a large CSV file using Pandas when I wrote this

df = pd.read_csv(strFileName,sep='\t',delimiter='\t') 

it raises "pandas.parser.CParserError: error data about failures. Error C: out of memory" wc -l indicates that there are 13822117 lines, I need to aggregate the csv file in this data frame, is there any way to handle this other and then split csv into multiple files and write codes to combine the results? Any suggestions on how to do this? thanks

The input is as follows:

 columns=[ka,kb_1,kb_2,timeofEvent,timeInterval] 0:'3M' '2345' '2345' '2014-10-5',3000 1:'3M' '2958' '2152' '2015-3-22',5000 2:'GE' '2183' '2183' '2012-12-31',515 3:'3M' '2958' '2958' '2015-3-10',395 4:'GE' '2183' '2285' '2015-4-19',1925 5:'GE' '2598' '2598' '2015-3-17',1915 

And the desired output looks like this:

 columns=[ka,kb,errorNum,errorRate,totalNum of records] '3M','2345',0,0%,1 '3M','2958',1,50%,2 'GE','2183',1,50%,2 'GE','2598',0,0%,1 

if the data set is small, the code below can be used as provided by another

 df2 = df.groupby(['ka','kb_1'])['isError'].agg({ 'errorNum': 'sum', 'recordNum': 'count' }) df2['errorRate'] = df2['errorNum'] / df2['recordNum'] ka kb_1 recordNum errorNum errorRate 3M 2345 1 0 0.0 2958 2 1 0.5 GE 2183 2 1 0.5 2598 1 0 0.0 

(error definition Record: when kb_1! = kb_2, the corresponding record is considered as an abnormal record)

+6
python memory csv
source share
3 answers

Based on your fragment, in an error when reading an csv file in a fragment , while reading line by line.

I assume kb_2 is an indicator of error,

 groups={} with open("data/petaJoined.csv", "r") as large_file: for line in large_file: arr=line.split('\t') #assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval k=arr[0]+','+arr[1] if not (k in groups.keys()) groups[k]={'record_count':0, 'error_sum': 0} groups[k]['record_count']=groups[k]['record_count']+1 groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2]) for k,v in groups.items: print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count'])) 

This piece of code saves all the groups in the dictionary and calculates the error rate after reading the entire file.

He will meet an exception in his memory if there are too many combinations of groups.

+1
source share

You did not specify what your intended aggregation will be, but if it simply sums and counts , you can fill it in chunks :

 dfs = pd.DataFrame() reader = pd.read_table(strFileName, chunksize=16*1024) # choose as appropriate for chunk in reader: temp = chunk.agg(...) # your logic here dfs.append(temp) df = dfs.agg(...) # redo your logic here 
+3
source share

What suggested by @chrisaycock is the preferred method if you need to sum or read

If you need to average, this will not work, because avg(a,b,c,d) not equal to avg(avg(a,b),avg(c,d))

I suggest using a map-like approach with streaming

create a file called map-col.py

 import sys for line in sys.stdin: print (line.split('\t')[col]) 

And a file called reduce-avg.py

 import sys s=0 n=0 for line in sys.stdin: s=s+float(line) n=n+1 print (s/n) 

And in order to run it all:

 cat strFileName|python map-col.py|python reduce-avg.py>output.txt 

This method will work regardless of file size and will not be exhausted.

+2
source share

All Articles