So, I have about 4,000 CSV files, and I need them to connect them all. Each file has two columns (row and float) and between 10,000-1,000,000 rows, and I want to join the first column (i.e. String variable).
I tried numpy.lib.recfunctions.join_by , but it was very slow. I switched to pandas.merge , and it was much faster, but still too slow, given the number (and size) of tables that I have. And it seems that it is very memory intensive - to the point where the machine becomes unusable when the file merge has hundreds of thousands of lines (I mainly use the MacBook Pro, 2.4 GHz, 4 GB).
So now I'm looking for alternatives - are there any other potential solutions that I am missing? What other external implementations for Python exist? Is there a document / website somewhere where the time complexity of each implementation is discussed and compared? Would it be more efficient if I just used Python, say sqlite3, and then sqlite3 made the connection? Is string a key issue? If I could use a numeric key, should it be faster?
If this helps you give a more concrete idea of ββwhat I'm trying to achieve, here is my code using pandas.merge :
import os import pandas as pd def load_and_merge(file_names, path_to_files, columns): ''' seq, str, dict -> pandas.DataFrame ''' output = pd.DataFrame(columns = ['mykey'])
(Mac OS X 10.6.8 and Python 2.7.5; Ubuntu 12.04 and Python 2.7.3)
Parzival
source share