Loading a very large CSV dataset in Python and R, Pandas fight

I load huge csv memory (18GB) into memory and notice very big differences between R and Python. This is an AWS ec2 r4.8xlarge that has 244 GB of memory . Obviously, this is an extreme example, but this principle is also true for small files on real machines.

When using, pd.read_csvmy file took ~ 30 minutes to download and took 174 GB of memory. Essentially so much that I can’t do anything about it then. In contrast, R fread()from the package data.tabletook ~ 7 minutes and only ~ 55 GB of memory.

Why does the pandas object take up much more memory than the data.table object? Also, why, in principle, is a pandas object nearly 10 times larger than a text file on disk? This is not like .csv - a particularly efficient way to store data in the first place.

+6
source share
2 answers

You won't be able to exceed speed fread, but as memory is used, I assume you have integers that read as 64-bit integers in python.

Assuming your file looks like this:

a,b
1234567890123456789,12345

In R you get:

sapply(fread('test.txt'), class)
#          a          b
#"integer64"  "integer"

While in python (on a 64-bit machine):

pandas.read_csv('test.txt').dtypes
#a   int64
#b   int64

, python. read_csv :

pandas.read_csv('test.txt', dtype={'b': numpy.int32}).dtypes
#a   int64
#b   int32

, R, python , CSV , , , "1" CSV 2 (char + ), 4 8 .

+12

dask. , dask pandas DataFrame, .

-1

All Articles