What is the fastest way to read data from a text file and extract it into a data frame?

I want to create a multi-index DataFrame by reading a text file. Is it quick to create a multi-index and then select data from a text file using df.loc[[],[]] or combine rows in a DataFrame and set the DataFrame index at the end? Or, is it faster to use a list or dict to store data as it reads from a file, and then create a DataFrame from them? Is there a more pythonic or faster option?

Example text file:

 A = 1 B = 1 C data 0 1 1 2 A = 1 B = 2 C data 1 3 2 4 A = 2 B = 1 C data 0 5 2 6 

Output data format:

 ABC data 1 1 0 1 1 2 1 2 1 3 2 4 2 1 0 5 2 6 

January 18th update: This is due to How to parse complex text files using Python? I also wrote a blog article explaining how to analyze complex files for beginners .

+7
performance python pandas dataframe
source share
2 answers

The pandas search item element is an expensive operation, so it is aligned by index. I read everything into arrays, created a DataFrame of values, and then set the hierarchical index directly. Usually much faster if you can avoid adding or searching.

Here is an example of a result assuming that you have an array of 2-D datasets with everything concentrated in:

 In [106]: dataset Out[106]: array([[1, 1, 0, 1], [1, 1, 1, 2], [1, 2, 1, 3], [1, 2, 2, 4], [2, 1, 0, 5], [2, 1, 2, 6]]) In [107]: pd.DataFrame(dataset,columns=['A','B','C', 'data']).set_index(['A', 'B', 'C']) ...: Out[107]: data ABC 1 1 0 1 1 2 2 1 3 2 4 2 1 0 5 2 6 In [108]: data_values = dataset[:, 3] ...: data_index = pd.MultiIndex.from_arrays( dataset[:,:3].T, names=list('ABC')) ...: pd.DataFrame(data_values, columns=['data'], index=data_index) ...: Out[108]: data ABC 1 1 0 1 1 2 2 1 3 2 4 2 1 0 5 2 6 In [109]: %timeit pd.DataFrame(dataset,columns=['A','B','C', 'data']).set_index(['A', 'B', 'C']) %%timeit 1000 loops, best of 3: 1.75 ms per loop In [110]: %%timeit ...: data_values = dataset[:, 3] ...: data_index = pd.MultiIndex.from_arrays( dataset[:,:3].T, names=list('ABC')) ...: pd.DataFrame(data_values, columns=['data'], index=data_index) ...: 1000 loops, best of 3: 642 ยตs per loop 
+8
source share

Text file analysis will be a major part of your service data.

If speed is a major issue, I suggest using pickle or shelve to store the DataFrame in a binary file, ready for use.

If you need to use a text file for any reason, a separate module can be written between the formats for translation.

+5
source share

All Articles