Python - Using pandas structures with large csv (iteration and chunksize)

Question

Python - Using pandas structures with large csv (iteration and chunksize)

I have a large csv file, about 600 MB with 11 million lines, and I want to create statistics such as pivots, histograms, charts, etc. Obviously, he is trying to just read it usually:

df = pd.read_csv('Check400_900.csv', sep='\t')

not working, so I found iteration and chunksize in a similar entry, so I used

 df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)

Well, I can for example print df.get_chunk(5) and search the entire file only with

 for chunk in df: print chunk

My problem: I don’t know how to use things like these below for the whole df, and not just for a single fragment

 plt.plot() print df.head() print df.describe() print df.dtypes customer_group3 = df.groupby('UserID') y3 = customer_group.size()

I hope my question is not so confusing

+8

python pandas csv dataframe bigdata

Thodoris p Nov 11 '15 at 1:48

source share

2 answers

You need to combine ammo. For example:

 df2 = pd.concat([chunk for chunk in df])

And then run your commands on df2

+3

user29791 Nov 11 '15 at 8:08

source share

jezrael · Accepted Answer · 2015-11-11T08:20:06+0000

I think you need concat chunks for df, because the output type of the function is:

 df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)

is not a data framework, but pandas.io.parsers.TextFileReader is the source .

 tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) print tp #<pandas.io.parsers.TextFileReader object at 0x00000000150E0048> df = pd.concat(tp, ignore_index=True)

I think it is necessary to add a parameter to ignore the index for the concat function, because avoid the duplicity of indexes.

Python - Using pandas structures with large csv (iteration and chunksize)

More articles: