Pandas add concat / append command using "large" DataFrames

Problem: I have data stored in a csv file with the following columns data / id / value. I have 15 files, each of which contains about 10-20 million lines. Each csv file covers a certain period, so the time indices do not overlap, but the columns (new identifiers appear from time to time, old ones disappear). I initially ran the script without a summary call, but then I ran into memory problems on my local machine (only 8 GB). Since there is a lot of redundancy in each file, a good output opens first (about 2/3 less data), but now the playoffs are performed. If I run the following script, the concat function will work "forever" (I always interrupt manually after a while (2h>)). Does Concat / append have size limits (I have approximately 10,000-20000 columns), or am I missing something? Any suggestions?

import pandas as pd path = 'D:\\' data = pd.DataFrame() #loop through list of raw file names for file in raw_files: data_tmp = pd.read_csv(path + file, engine='c', compression='gzip', low_memory=False, usecols=['date', 'Value', 'ID']) data_tmp = data_tmp.pivot(index='date', columns='ID', values='Value') data = pd.concat([data,data_tmp]) del data_tmp 

EDIT I: To clarify, each csv file has about 10-20mio lines and three columns, after pivot is applied, it reduces to about 2000 lines, but results in 10000 columns.

I can solve the memory problem by simply breaking the complete set of identifiers into subsets and running the necessary calculations based on each subset, since they are independent for each identifier. I know that this forces me to reload the same files n times, where n is the number of subsets used, but this is still reasonably fast. I am still wondering why append is not executing.

EDIT II: I tried to recreate a file structure with a simulation that is as close as possible to the actual data structure. I hope this is clear, I did not spend much time minimizing the simulation time, but it works quickly on my machine.

 import string import random import pandas as pd import numpy as np import math # Settings :------------------------------- num_ids = 20000 start_ids = 4000 num_files = 10 id_interval = int((num_ids-start_ids)/num_files) len_ids = 9 start_date = '1960-01-01' end_date = '2014-12-31' run_to_file = 2 # ------------------------------------------ # Simulation column IDs id_list = [] # ensure unique elements are of size >num_ids for x in range(num_ids + round(num_ids*0.1)): id_list.append(''.join( random.choice(string.ascii_uppercase + string.digits) for _ in range(len_ids))) id_list = set(id_list) id_list = list(id_list)[:num_ids] time_index = pd.bdate_range(start_date,end_date,freq='D') chunk_size = math.ceil(len(time_index)/num_files) data = [] # Simulate files for file in range(0, run_to_file): tmp_time = time_index[file * chunk_size:(file + 1) * chunk_size] # TODO not all cases cover, make sure ints are obtained tmp_ids = id_list[file * id_interval: start_ids + (file + 1) * id_interval] tmp_data = pd.DataFrame(np.random.standard_normal( (len(tmp_time), len(tmp_ids))), index=tmp_time, columns=tmp_ids) tmp_file = tmp_data.stack().sortlevel(1).reset_index() # final simulated data structure of the parsed csv file tmp_file = tmp_file.rename(columns={'level_0': 'Date', 'level_1': 'ID', 0: 'Value'}) # comment/uncomment if pivot takes place on aggregate level or not tmp_file = tmp_file.pivot(index='Date', columns='ID', values='Value') data.append(tmp_file) data = pd.concat(data) # comment/uncomment if pivot takes place on aggregate level or not # data = data.pivot(index='Date', columns='ID', values='Value') 
+4
source share
3 answers

Using your reproducible sample code, I really can confirm that concat only two frames takes a very long time. However, if you align them first (make the column names equal), then the compression is very fast:

 In [94]: df1, df2 = data[0], data[1] In [95]: %timeit pd.concat([df1, df2]) 1 loops, best of 3: 18min 8s per loop In [99]: %%timeit ....: df1b, df2b = df1.align(df2, axis=1) ....: pd.concat([df1b, df2b]) ....: 1 loops, best of 3: 686 ms per loop 

The result of both approaches is the same.
Alignment is equivalent to:

 common_columns = df1.columns.union(df2.columns) df1b = df1.reindex(columns=common_columns) df2b = df2.reindex(columns=common_columns) 

Thus, this is probably an easier way to use when working with a complete list of data.

The reason pd.concat slower is because it does more. For instance. when the column names are not equal, it checks each column if the dtype should be elevated or not contain NaN values ​​(which are entered by aligning the column names). By agreeing, you will skip this. But in this case, when you are sure that you have the same type, this is not a problem.
The fact that this is much slower surprises me, but I will try to talk about it.

+11
source

Summary, three key performance indicators based on your setup:

1) Make sure the data type is the same when combining two data frames

2) If possible, use integer-based column names

3) When using string columns, make sure you use the alignment method before concat is called as suggested by joris

+2
source

As @joris mentioned, you have to add all pivot tables to the list, and then combine them all in one go. Here is a suggested modification to your code:

 dfs = [] for file in raw_files: data_tmp = pd.read_csv(path + file, engine='c', compression='gzip', low_memory=False, usecols=['date', 'Value', 'ID']) data_tmp = data_tmp.pivot(index='date', columns='ID', values='Value') dfs.append(data_tmp) del data_tmp data = pd.concat(dfs) 
+1
source

All Articles