Python - Using pandas structures with large csv (iteration and chunksize)

I have a large csv file, about 600 MB with 11 million lines, and I want to create statistics such as pivots, histograms, charts, etc. Obviously, he is trying to just read it usually:

df = pd.read_csv('Check400_900.csv', sep='\t') 

not working, so I found iteration and chunksize in a similar entry, so I used

 df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) 

Well, I can for example print df.get_chunk(5) and search the entire file only with

 for chunk in df: print chunk 

My problem: I donโ€™t know how to use things like these below for the whole df, and not just for a single fragment

 plt.plot() print df.head() print df.describe() print df.dtypes customer_group3 = df.groupby('UserID') y3 = customer_group.size() 

I hope my question is not so confusing

+8
python pandas csv dataframe bigdata
source share
2 answers

I think you need concat chunks for df, because the output type of the function is:

 df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) 

is not a data framework, but pandas.io.parsers.TextFileReader is the source .

 tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) print tp #<pandas.io.parsers.TextFileReader object at 0x00000000150E0048> df = pd.concat(tp, ignore_index=True) 

I think it is necessary to add a parameter to ignore the index for the concat function, because avoid the duplicity of indexes.

+13
source share

You need to combine ammo. For example:

 df2 = pd.concat([chunk for chunk in df]) 

And then run your commands on df2

+3
source share

All Articles