I am trying to manipulate a large CSV file using Pandas when I wrote this
df = pd.read_csv(strFileName,sep='\t',delimiter='\t')
it raises "pandas.parser.CParserError: error data about failures. Error C: out of memory" wc -l indicates that there are 13822117 lines, I need to aggregate the csv file in this data frame, is there any way to handle this other and then split csv into multiple files and write codes to combine the results? Any suggestions on how to do this? thanks
The input is as follows:
columns=[ka,kb_1,kb_2,timeofEvent,timeInterval] 0:'3M' '2345' '2345' '2014-10-5',3000 1:'3M' '2958' '2152' '2015-3-22',5000 2:'GE' '2183' '2183' '2012-12-31',515 3:'3M' '2958' '2958' '2015-3-10',395 4:'GE' '2183' '2285' '2015-4-19',1925 5:'GE' '2598' '2598' '2015-3-17',1915
And the desired output looks like this:
columns=[ka,kb,errorNum,errorRate,totalNum of records] '3M','2345',0,0%,1 '3M','2958',1,50%,2 'GE','2183',1,50%,2 'GE','2598',0,0%,1
if the data set is small, the code below can be used as provided by another
df2 = df.groupby(['ka','kb_1'])['isError'].agg({ 'errorNum': 'sum', 'recordNum': 'count' }) df2['errorRate'] = df2['errorNum'] / df2['recordNum'] ka kb_1 recordNum errorNum errorRate 3M 2345 1 0 0.0 2958 2 1 0.5 GE 2183 2 1 0.5 2598 1 0 0.0
(error definition Record: when kb_1! = kb_2, the corresponding record is considered as an abnormal record)