I created a repository in HDF5 with hierarchical keys with the following structure
<class 'pandas.io.pytables.HDFStore'>
File path: path-analysis/data/store.h5
/attribution/attr_000000 frame (shape->[1,5])
/attribution/attr_000001 frame (shape->[1,5])
/attribution/attr_000002 frame (shape->[1,5])
/attribution/attr_000003 frame (shape->[1,5])
.....
/impression/imp_000000 frame (shape->[1,5])
/impression/imp_000001 frame (shape->[1,5])
/impression/imp_000002 frame (shape->[1,5])
/impression/imp_000003 frame (shape->[1,5])
.....
From what I read in the documentation, I should have access to display and attribution as follows.
store.select('impression')
store.select('attribution')
However, I get an error: TypeError: cannot create storage if the object does not exist or the value is not passed
To add data to the repository, I iterated over my data frames
store.put('impression/imp_' + name, df)
Initially, I used the append api to create the impression of a single table, but it did 80 seconds per frame, and given that I have almost 200 files to process, the append turned out to be too slow.
In comparison, "put" takes less than a second to add to the repository, however it does not allow me to select data later.
, ?
, , ? ?
.
.
?
df info
<class 'pandas.core.frame.DataFrame'>
Int64Index: 251756 entries, 0 to 257114
Data columns (total 5 columns):
pmessage_type 251756 non-null object
channel 251756 non-null object
source_timestamp 251756 non-null object
winning_price 251756 non-null int64
ipaddress 251756 non-null object
dtypes: int64(1), object(4)None
pmessage_type, , source_timestamp, WINNING_PRICE, IPAddress 0, , , 1400792099000,1800,99.34.198.9 1, , , 1401587896000,200,99.60.68.61 2, , , 1400873220000,735,65.96.72.183 3, , , 1400768556000,5550,73.182.225.30 4, , , 1401255378000,2099,65.96.72.183 5, , , 1400992770000,88,73.182.225.30 6, , , 1400709948000,290,162.228.58.98 7, , , 1400634607000,1720,162.228.58.98 8, , , 1399201568000,710,108.206.240.138
df.to_csv (...) .
.
data = pd.read_csv(events_csv_file,
delimiter='\x01',
header=None,
names=my_columns.keys(),
dtype=my_columns,
usecols=my_subset_columns,
iterator=True,
chunksize=1e6)
df = pd.concat(data)
- :
{'attribution_strategy': object,
'channel': object,
'flight_uid': object,
'ipaddress': object,
'pixel_id': object,
'pmessage_type': object,
'source_timestamp': object,
'source_unique_id': object,
'unique_id': object,
'user_id': object,
'winning_price': numpy.int64}
. ( - , , )
, pandas,
>>> pandas.__version__
'0.14.0'
>>>
=====================
,
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'foo',
'bar', 'bar', 'bar', 'bar',
'foo', 'foo', 'foo'],
'B': ['one', 'one', 'one', 'two',
'one', 'one', 'one', 'two',
'two', 'two', 'one'],
'C': ['dull', 'dull', 'shiny', 'dull',
'dull', 'shiny', 'shiny', 'dull',
'shiny', 'shiny', 'shiny'],
'D': np.random.randn(11),
'E': np.random.randn(11),
'F': np.random.randn(11)})
store = pd.HDFStore('mystore.h5')
store.put('data/01', df)
store.put('data/02', df)
print store
<class 'pandas.io.pytables.HDFStore'>
File path: mystore.h5
/data/01 frame (shape->[11,6])
/data/02 frame (shape->[11,6])
store.select('data')
:
TypeError Traceback (most recent call last)
<ipython-input-33-60360d11cde5> in <module>()
/Users/sshegheva/anaconda/envs/numba/lib/python2.7/site-packages/pandas/io/pytables.pyc in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
650
651 where = _ensure_term(where, scope_level=1)
653 s.infer_axes()
654
/Users/sshegheva/anaconda/envs/numba/lib/python2.7/site-packages/pandas/io/pytables.pyc in _create_storer(self, group, format, value, append, **kwargs)
1157 else:
1158 raise TypeError(
-> 1159 "cannot create a storer if the object is not existing "
1160 "nor a value are passed")
1161 else:
TypeError: cannot create a storer if the object is not existing nor a value are passed
store.remove('data')