Python outer join for thousands of large tables

Question

Python outer join for thousands of large tables

So, I have about 4,000 CSV files, and I need them to connect them all. Each file has two columns (row and float) and between 10,000-1,000,000 rows, and I want to join the first column (i.e. String variable).

I tried numpy.lib.recfunctions.join_by , but it was very slow. I switched to pandas.merge , and it was much faster, but still too slow, given the number (and size) of tables that I have. And it seems that it is very memory intensive - to the point where the machine becomes unusable when the file merge has hundreds of thousands of lines (I mainly use the MacBook Pro, 2.4 GHz, 4 GB).

So now I'm looking for alternatives - are there any other potential solutions that I am missing? What other external implementations for Python exist? Is there a document / website somewhere where the time complexity of each implementation is discussed and compared? Would it be more efficient if I just used Python, say sqlite3, and then sqlite3 made the connection? Is string a key issue? If I could use a numeric key, should it be faster?

If this helps you give a more concrete idea of what I'm trying to achieve, here is my code using pandas.merge :

 import os import pandas as pd def load_and_merge(file_names, path_to_files, columns): ''' seq, str, dict -> pandas.DataFrame ''' output = pd.DataFrame(columns = ['mykey']) # initialize output DataFrame for file in file_names: # load new data new_data = pd.read_csv(path + file, usecols = [col for col in columns.keys()], dtype = columns, names = ['mykey', file.replace('.csv', '')], header = None) # merge with previous data output = pd.merge(output, new_data, on = 'mykey', how = 'outer') output = output.fillna(0) # kill NaNs return output path = '/Users/username/data/' files_list = [file for file in os.listdir(path) if '.csv' in file] merged_table = load_and_merge(files_list, path, {0: 'S30', 1: 'float'})

(Mac OS X 10.6.8 and Python 2.7.5; Ubuntu 12.04 and Python 2.7.3)

+7

python python-2.7 numpy pandas

Parzival Oct 18 '13 at 14:51

source share

2 answers

Well, here is a partial implementation of Jeff's approach (see his answer above), if I understand correctly. I post this if someone is trying to do something like this. And also in case someone can help improve or "remove" this code (right now this is a long, ugly stream of code ... I think I should somehow modulate it).

This is a partial implementation because I did not parallel merge. I tried using the Python multiprocessing module, but apparently there wasn’t enough computer memory for it - two simultaneous processes were enough to freeze everything (or maybe I just did something completely idiotic), it’s very possible since I never have not used multiprocessing until). But the rest is here: hierarchical mergers and HDF5 (for storing intermediate files).

 #!/usr/bin/env python import os import sys import pandas as pd def merge_csv(folder, cols_info): ''' str, dict -> pandas.DataFrame This function outer joins all CSV files in the specified folder. It joins them hierarchically and stores the intermediate files in an HDF5 container. The first parameter is the path to the folder and the second parameter is a dictionary mapping each column to the corresponding data type. You can only specify two columns. Example: merge_csv('/Users/username/data/', {0: 'S30', 2: 'float'}) Dependencies: - HDF5 - PyTables - pandas ''' # ensure that user is specifying only two columns if len(cols_info) != 2: sys.exit('Error: You can only specify two columns.') # ensure that path to input folder ends with '/' folder = folder + '/' if folder[-1] != '/' else folder # create HDF5 file to store intermediate files store = pd.HDFStore(folder + 'store.h5', mode = 'w') # load CSV files and write data to HDF5 file flist = [file for file in os.listdir(folder) if file[-4:] == '.csv'] if len(flist) == 0: sys.exit('There are no CSV files in the specified folder.') for file in flist: case = file.replace('.csv', '') store[case] = pd.read_csv(folder + file, usecols = [col for col in cols_info], names = ['key', case], dtype = cols_info) store.flush() # start merge routine flist = store.keys() counter = 0 while len(flist) > 1: counter += 1 # merge current set of files, two by two length = (len(flist) - 1) if (len(flist) % 2 == 1) else len(flist) for i in range(length)[0:length:2]: merged_data = pd.merge(store[flist[i]], store[flist[i + 1]], on = 'key', how = 'outer', sort = False) outputfile = 'file' + str(i) + str(i + 1) # if number of files is odd, make last pair a trio if (i == len(flist) - 3) and (len(flist) % 2 == 1): merged_data = pd.merge(merged_data, store[flist[i + 2]], on = 'key', how = 'outer', sort = False) outputfile += str(i + 2) # save merged pair (or trio) to HDF5 file merged_data = merged_data.fillna(0) store.put('/tmp' + str(counter) + '/' + outputfile, merged_data) store.flush() # clean up to_remove = [file for file in store.keys() if 'tmp' + str(counter) + '/' not in file] for file in to_remove: store.remove(file) # move on to next set of intermediate files flist = store.keys() # wrap up store.close() return merged_data

EDIT

Still not good: intermediate matrices eventually become too large and exceed the computer's memory, and the code crashes ( tables.exceptions.HDF5ExtError: Problems creating the Array. ). I tried sqlite3, but that didn't work either, so I think I just need to search.

0

Parzival Oct 24 '13 at 1:50

source share

Jeff · Accepted Answer · 2013-10-18T17:16:59+0000

Here is how I could approach this problem.

Do not merge iteratively. You combine a small frame (call it "mergee") with a larger frame (call it "merge"). Then we repeat this, as a result of which the "merge" will become larger and will have more lines.

Instead, you can perform repeated hierarchical mergers. Let's say you counted a merge of 1-4000.

combine 1 and 2 to form 1_2

Then repeat so that you combine 1_2 and 3_4 into the form 1_2_3_4

This does not change the amount of work you do, but makes each merge much simpler, which reduces the memory barrier and the time spent (since this should go through the Cartesian product of the keys). It might make sense to randomize the merge order.

In addition, this is completely paralyzed, since you can have independent processes to solve this problem, generating intermediate merges. This, in essence, becomes a problem with decreasing map display.

You can also save intermediate merges in HDF5 files (using HDFStore ), which will make storage quite efficient. Note that these must be SEPARATE files to avoid writing to a single file with multiple processes (which is not supported by HDF5).

Python outer join for thousands of large tables

More articles: