I am processing a 2.5 GB csv file. Table 2.5 GB looks like this:
columns=[ka,kb_1,kb_2,timeofEvent,timeInterval] 0:'3M' '2345' '2345' '2014-10-5',3000 1:'3M' '2958' '2152' '2015-3-22',5000 2:'GE' '2183' '2183' '2012-12-31',515 3:'3M' '2958' '2958' '2015-3-10',395 4:'GE' '2183' '2285' '2015-4-19',1925 5:'GE' '2598' '2598' '2015-3-17',1915
And I want to group ka and kb_1 to get the result like this:
columns=[ka,kb,errorNum,errorRate,totalNum of records] '3M','2345',0,0%,1 '3M','2958',1,50%,2 'GE','2183',1,50%,2 'GE','2598',0,0%,1
(error definition Record: when kb_1 != kb_2 , the corresponding record is considered as an abnormal record)
My computer, which is ubuntu 12.04, has 16 GB of memory and free -m returns
total used free shared buffers cached Mem: 112809 14476 98333 0 128 10823 -/+ buffers/cache: 3524 109285 Swap: 0 0 0
My python file is called bigData.py
import pandas as pd import numpy as np import sys,traceback,os cksize=98333
ipdb> pd.__version__ '0.16.0'
I use the following command to track memory usage:
top ps -C python -o %cpu,%mem,cmd
Since it takes about 2 seconds to crash, I see that mem usage reached 90% for some time, and CPU reached 100%
When I excecute python bigData.py , the following error occurs:
/usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module dateutil was already imported from /usr/local/lib/python2.7/dist-packages/dateutil/__init__.pyc, but /usr/lib/python2.7/dist-packages is being added to sys.path from pkg_resources import resource_stream /usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module pytz was already imported from /usr/local/lib/python2.7/dist-packages/pytz/__init__.pyc, but /usr/lib/python2.7/dist-packages is being added to sys.path from pkg_resources import resource_stream Traceback (most recent call last): File "bigData.py", line 10, in <module> for chunk in reader: File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 691, in __iter__ yield self.read(self.chunksize) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 715, in read ret = self._engine.read(nrows) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1164, in read data = self._reader.read(nrows) File "pandas/parser.pyx", line 758, in pandas.parser.TextReader.read (pandas/parser.c:7411) File "pandas/parser.pyx", line 792, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7819) File "pandas/parser.pyx", line 833, in pandas.parser.TextReader._read_rows (pandas/parser.c:8268) File "pandas/parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8142) File "pandas/parser.pyx", line 1758, in pandas.parser.raise_parser_error (pandas/parser.c:20728) CParserError: Error tokenizing data. C error: out of memory Segmentation fault (core dumped)
or
/usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module dateutil was already imported from /usr/local/lib/python2.7/dist-packages/dateutil/__init__.pyc, but /usr/lib/python2.7/dist-packages is being added to sys.path from pkg_resources import resource_stream /usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module pytz was already imported from /usr/local/lib/python2.7/dist-packages/pytz/__init__.pyc, but /usr/lib/python2.7/dist-packages is being added to sys.path from pkg_resources import resource_stream Traceback (most recent call last): File "bigData.py", line 10, in <module> for chunk in reader: File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 691, in __iter__ yield self.read(self.chunksize) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 715, in read ret = self._engine.read(nrows) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1164, in read data = self._reader.read(nrows) File "pandas/parser.pyx", line 758, in pandas.parser.TextReader.read (pandas/parser.c:7411) File "pandas/parser.pyx", line 792, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7819) File "pandas/parser.pyx", line 833, in pandas.parser.TextReader._read_rows (pandas/parser.c:8268) File "pandas/parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8142) File "pandas/parser.pyx", line 1758, in pandas.parser.raise_parser_error (pandas/parser.c:20728) CParserError: Error tokenizing data. C error: out of memory *** glibc detected *** python: free(): invalid pointer: 0x00007f750d2a4c0e *** ====== Backtrace: ======== /lib/x86_64-linux-gnu/libc.so.6(+0x7db26)[0x7f7511529b26] /usr/local/lib/python2.7/dist-packages/pandas/parser.so(+0x4d5a1)[0x7f750d29d5a1] /usr/local/lib/python2.7/dist-packages/pandas/parser.so(parser_cleanup+0x15)[0x7f750d29de45] /usr/local/lib/python2.7/dist-packages/pandas/parser.so(parser_free+0x9)[0x7f750d29e039] /usr/local/lib/python2.7/dist-packages/pandas/parser.so(+0xb43e)[0x7f750d25b43e] .... python(PyDict_SetItem+0x49)[0x577749] python(_PyModule_Clear+0x149)[0x4cafb9] python(PyImport_Cleanup+0x477)[0x4cb4f7] python(Py_Finalize+0x18e)[0x549f0e] python(Py_Main+0x3bc)[0x56b56c] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f75114cd76d] python[0x41bb11] ======= Memory map: ======== 00400000-00670000 r-xp 00000000 08:01 26612 /usr/bin/python2.7 0086f000-00870000 r
with the code below, there is no memory problem, but what can the code do below, I mean performing grouping and data aggregation
with open("data/petaJoined.csv", "r") as content: for line in content:
Does anyone know what is going on?
Actually I want to achieve the result shown in Pandas read csv from memory
Maybe there will be a solution?
Note. I already use csv reading in a piece, but there is still a memory error.
Then I resized the chunk so that another bigData.py file is
import pandas as pd import numpy as np import sys, traceback, os import etl2
However, after execution for some time there will be a segmentation error
def tb_createTopRankTable(df): try: key='name1' key2='name2' df2 = df.groupby([key,key2])['isError'].agg({ 'errorNum': 'sum','totalParcel': 'count' }) df2['errorRate'] = df2['errorNum'] / df2['totalParcel'] return df2