The power of datetime strings is not huge. For example, the number of time lines in the format %H-%M-%S is 24 * 60 * 60 = 86400 . If the number of rows in your dataset is much larger than this or your data contains many repeating timestamps, adding a cache to the parsing process can significantly speed up the process.
For those who don't have Cython, here is an alternative solution in pure python:
import numpy as np import pandas as pd from datetime import datetime def parse_datetime(dt_array, cache=None): if cache is None: cache = {} date_time = np.empty(dt_array.shape[0], dtype=object) for i, (d_str, t_str) in enumerate(dt_array): try: year, month, day = cache[d_str] except KeyError: year, month, day = [int(item) for item in d_str[:10].split('-')] cache[d_str] = year, month, day try: hour, minute, sec = cache[t_str] except KeyError: hour, minute, sec = [int(item) for item in t_str.split(':')] cache[t_str] = hour, minute, sec date_time[i] = datetime(year, month, day, hour, minute, sec) return pd.to_datetime(date_time) def read_csv(filename, cache=None): df = pd.read_csv(filename) df['date_time'] = parse_datetime(df.loc[:, ['date', 'time']].values, cache=cache) return df.set_index('date_time')
With the following specific dataset, the acceleration is 150x +:
$ ls -lh test.csv -rw-r--r-- 1 blurrcat blurrcat 1.2M Apr 8 12:06 test.csv $ head -n 4 data/test.csv user_id,provider,date,time,steps 5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,17:47:00,25 5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,17:09:00,4 5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,19:10:00,67
In ipython:
In [1]: %timeit pd.read_csv('test.csv', parse_dates=[['date', 'time']]) 1 loops, best of 3: 10.3 s per loop In [2]: %timeit read_csv('test.csv', cache={}) 1 loops, best of 3: 62.6 ms per loop
To limit memory usage, simply replace the cache drive with something like LRU.