Speed improvement for large pandas read_csv with datetime index

Question

Speed improvement for large pandas read_csv with datetime index

I have huge files that look like this:

05/31 / 2012,15: 30: 00.029,1306.25,1, E, 0, 1306.25

05/31/2012.15: 30: 00.029,1306.25.8, E, 0, 1306.25

I can easily read them using the following:

pd.read_csv(gzip.open("myfile.gz"), header=None,names= ["date","time","price","size","type","zero","empty","last"], parse_dates=[[0,1]])

Is there a way to efficiently parse dates like this in pandas timestamps? If not, is there any guide to writing a cython function that can be passed to date_parser =?

I tried writing my own parser function, and it still takes too much time for the project I'm working on.

+7

performance python pandas date-formatting

Michael ws Jan 21 '13 at 20:33

source share

3 answers

I got incredible acceleration (50X) with the following cython code:

call from python: timestamps = convert_date_cython (df ["date"]. values, df ["time"]. values)

 cimport numpy as np import pandas as pd import datetime import numpy as np def convert_date_cython(np.ndarray date_vec, np.ndarray time_vec): cdef int i cdef int N = len(date_vec) cdef out_ar = np.empty(N, dtype=np.object) date = None for i in range(N): if date is None or date_vec[i] != date_vec[i - 1]: dt_ar = map(int, date_vec[i].split("/")) date = datetime.date(dt_ar[2], dt_ar[0], dt_ar[1]) time_ar = map(int, time_vec[i].split(".")[0].split(":")) time = datetime.time(time_ar[0], time_ar[1], time_ar[2]) out_ar[i] = pd.Timestamp(datetime.datetime.combine(date, time)) return out_ar

+6

Michael ws Apr 04 '13 at 13:32

source share

The power of datetime strings is not huge. For example, the number of time lines in the format %H-%M-%S is 24 * 60 * 60 = 86400 . If the number of rows in your dataset is much larger than this or your data contains many repeating timestamps, adding a cache to the parsing process can significantly speed up the process.

For those who don't have Cython, here is an alternative solution in pure python:

 import numpy as np import pandas as pd from datetime import datetime def parse_datetime(dt_array, cache=None): if cache is None: cache = {} date_time = np.empty(dt_array.shape[0], dtype=object) for i, (d_str, t_str) in enumerate(dt_array): try: year, month, day = cache[d_str] except KeyError: year, month, day = [int(item) for item in d_str[:10].split('-')] cache[d_str] = year, month, day try: hour, minute, sec = cache[t_str] except KeyError: hour, minute, sec = [int(item) for item in t_str.split(':')] cache[t_str] = hour, minute, sec date_time[i] = datetime(year, month, day, hour, minute, sec) return pd.to_datetime(date_time) def read_csv(filename, cache=None): df = pd.read_csv(filename) df['date_time'] = parse_datetime(df.loc[:, ['date', 'time']].values, cache=cache) return df.set_index('date_time')

With the following specific dataset, the acceleration is 150x +:

 $ ls -lh test.csv -rw-r--r-- 1 blurrcat blurrcat 1.2M Apr 8 12:06 test.csv $ head -n 4 data/test.csv user_id,provider,date,time,steps 5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,17:47:00,25 5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,17:09:00,4 5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,19:10:00,67

In ipython:

 In [1]: %timeit pd.read_csv('test.csv', parse_dates=[['date', 'time']]) 1 loops, best of 3: 10.3 s per loop In [2]: %timeit read_csv('test.csv', cache={}) 1 loops, best of 3: 62.6 ms per loop

To limit memory usage, simply replace the cache drive with something like LRU.

+2

blurrcat Apr 08 '15 at 6:41

source share

Vladimir · Accepted Answer · 2013-07-11T10:24:57+0000

Improvement to Michael WS's previous solution :

conversion to pandas.Timestamp best done outside of Cython code
atoi and native-c string processing is slightly faster than python functions
the number of calls datetime -lib is reduced to one of 2 (+1 random for the date)
microseconds are also processed

NB! The date order in this code is day / month / year.

In general, the code seems about 10 times faster than the original convert_date_cython . However, if it is called after read_csv , then on the SSD hard drive, the difference in total time is only a few percent due to reading costs. I would suggest that on a regular hard drive the difference would be even smaller.

 cimport numpy as np import datetime import numpy as np import pandas as pd from libc.stdlib cimport atoi, malloc, free from libc.string cimport strcpy ### Modified code from Michael WS: ### /questions/789763/speed-improvement-on-large-pandas-readcsv-with-datetime-index/2927355#2927355 def convert_date_fast(np.ndarray date_vec, np.ndarray time_vec): cdef int i, d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms cdef int N = len(date_vec) cdef np.ndarray out_ar = np.empty(N, dtype=np.object) cdef bytes prev_date = <bytes> 'xx/xx/xxxx' cdef char *date_str = <char *> malloc(20) cdef char *time_str = <char *> malloc(20) for i in range(N): if date_vec[i] != prev_date: prev_date = date_vec[i] strcpy(date_str, prev_date) ### xx/xx/xxxx date_str[2] = 0 date_str[5] = 0 d_year = atoi(date_str+6) d_month = atoi(date_str+3) d_day = atoi(date_str) strcpy(time_str, time_vec[i]) ### xx:xx:xx:xxxxxx time_str[2] = 0 time_str[5] = 0 time_str[8] = 0 t_hour = atoi(time_str) t_min = atoi(time_str+3) t_sec = atoi(time_str+6) t_ms = atoi(time_str+9) out_ar[i] = datetime.datetime(d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms) free(date_str) free(time_str) return pd.to_datetime(out_ar)

Speed ​​improvement for large pandas read_csv with datetime index

More articles:

Speed improvement for large pandas read_csv with datetime index