How to eliminate the HDFStore exception: cannot find the correct atom type

Question

How to eliminate the HDFStore exception: cannot find the correct atom type

I am looking for some general guidelines on what data types may cause this exception. I tried massaging my data in various ways, but to no avail.

I have this exception for several days, there have been several discussions in the google group and no solution has been found to debug the HDFStore Exception: cannot find the correct atom type . I am reading a simple csv file of mixed data types:

 Int64Index: 401125 entries, 0 to 401124 Data columns: SalesID 401125 non-null values SalePrice 401125 non-null values MachineID 401125 non-null values ModelID 401125 non-null values datasource 401125 non-null values auctioneerID 380989 non-null values YearMade 401125 non-null values MachineHoursCurrentMeter 142765 non-null values UsageBand 401125 non-null values saledate 401125 non-null values fiModelDesc 401125 non-null values Enclosure_Type 401125 non-null values ................................................... Stick_Length 401125 non-null values Thumb 401125 non-null values Pattern_Changer 401125 non-null values Grouser_Type 401125 non-null values Backhoe_Mounting 401125 non-null values Blade_Type 401125 non-null values Travel_Controls 401125 non-null values Differential_Type 401125 non-null values Steering_Controls 401125 non-null values dtypes: float64(2), int64(6), object(45)

Code for storing data frame:

 In [30]: store = pd.HDFStore('test0.h5','w') In [31]: for chunk in pd.read_csv('Train.csv', chunksize=10000): ....: store.append('df', chunk, index=False)

Note that if I use store.put on a data frame imported with one shot, I can save it successfully, albeit slowly (I believe this is due to etching for dtypes objects, although the object is just string data).

Are there any NaN value considerations that may cause this exception?

An exception:

 Exception: cannot find the correct atom type -> [dtype->object,items->Index([Usa geBand, saledate, fiModelDesc, fiBaseModel, fiSecondaryDesc, fiModelSeries, fiMo delDescriptor, ProductSize, fiProductClassDesc, state, ProductGroup, ProductGrou pDesc, Drive_System, Enclosure, Forks, Pad_Type, Ride_Control, Stick, Transmissi on, Turbocharged, Blade_Extension, Blade_Width, Enclosure_Type, Engine_Horsepowe r, Hydraulics, Pushblock, Ripper, Scarifier, Tip_Control, Tire_Size, Coupler, Co upler_System, Grouser_Tracks, Hydraulics_Flow, Track_Type, Undercarriage_Pad_Wid th, Stick_Length, Thumb, Pattern_Changer, Grouser_Type, Backhoe_Mounting, Blade_ Type, Travel_Controls, Differential_Type, Steering_Controls], dtype=object)] lis t index out of range

UPDATE 1

Jeff's advice on lists stored in a dataframe made me examine embedded commas. pandas.read_csv parses the file correctly, and there are some built-in commas in double quotes. Thus, these fields are not python lists as such, but there are commas in the text. Here are some examples:

 3 Hydraulic Excavator, Track - 12.0 to 14.0 Metric Tons 6 Hydraulic Excavator, Track - 21.0 to 24.0 Metric Tons 8 Hydraulic Excavator, Track - 3.0 to 4.0 Metric Tons 11 Track Type Tractor, Dozer - 20.0 to 75.0 Horsepower 12 Hydraulic Excavator, Track - 19.0 to 21.0 Metric Tons

However, when I remove this column from the pd.read_csv blocks and add to my HDFStore, I still get the same Exception. When I try to add each column separately, I get the following new exception:

 In [6]: for chunk in pd.read_csv('Train.csv', header=0, chunksize=50000): ...: for col in chunk.columns: ...: store.append(col, chunk[col], data_columns=True) Exception: cannot properly create the storer for: [_TABLE_MAP] [group->/SalesID (Group) '',value-><class 'pandas.core.series.Series'>,table->True,append->True,k wargs->{'data_columns': True}]

I will continue troubleshooting. Here is a link to several hundred entries:

https://docs.google.com/spreadsheet/ccc?key=0AutqBaUiJLbPdHFvaWNEMk5hZ1NTNlVyUVduYTZTeEE&usp=sharing

UPDATE 2

Ok, I tried the following on my working computer and got the following result:

 In [4]: store = pd.HDFStore('test0.h5','w') In [5]: for chunk in pd.read_csv('Train.csv', chunksize=10000): ...: store.append('df', chunk, index=False, data_columns=True) ...: Exception: cannot find the correct atom type -> [dtype->object,items->Index([fiB aseModel], dtype=object)] [fiBaseModel] column has a min_itemsize of [13] but it emsize [9] is required!

I think I know what's going on here. If I take the maximum fiBaseModel field fiBaseModel for the first fragment, I get the following:

 In [16]: lens = df.fiBaseModel.apply(lambda x: len(x)) In [17]: max(lens[:10000]) Out[17]: 9

And the second fragment:

 In [18]: max(lens[10001:20000]) Out[18]: 13

Thus, the storage table is created with 9 bytes for this column, because this is the maximum of the first fragment. When it encounters a longer text field in subsequent snippets, it throws an exception.

My suggestions for this are to either truncate the data in subsequent fragments (with a warning), or let the user specify the maximum storage for the column and truncate anything that exceeds it. Maybe pandas can do it already, I didn’t have time to really immerse HDFStore in the HDFStore .

UPDATE 3

Trying to import csv dataset using pd.read_csv. I pass the dictionary of all objects to the dtypes parameter. Then I iterate over the file and save each piece in the HDFStore, passing a large value to min_itemsize . I get the following exception:

 AttributeError: 'NoneType' object has no attribute 'itemsize'

My simple code is:

 store = pd.HDFStore('test0.h5','w') objects = dict((col,'object') for col in header) for chunk in pd.read_csv('Train.csv', header=0, dtype=objects, chunksize=10000, na_filter=False): store.append('df', chunk, min_itemsize=200)

I tried to debug and check items in a stack trace. This is what the table looks like an exception:

 ipdb> self.table /df/table (Table(10000,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": StringCol(itemsize=200, shape=(53,), dflt='', pos=1)} byteorder := 'little' chunkshape := (24,) autoIndex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

UPDATE 4

Now I am trying to iteratively determine the length of the longest row in the columns of an object of my framework. Here is how I do it:

  def f(x): if x.dtype != 'object': return else: return len(max(x.fillna(''), key=lambda x: len(str(x)))) lengths = pd.DataFrame([chunk.apply(f) for chunk in pd.read_csv('Train.csv', chunksize=50000)]) lens = lengths.max().dropna().to_dict() In [255]: lens Out[255]: {'Backhoe_Mounting': 19.0, 'Blade_Extension': 19.0, 'Blade_Type': 19.0, 'Blade_Width': 19.0, 'Coupler': 19.0, 'Coupler_System': 19.0, 'Differential_Type': 12.0 ... etc... }

As soon as I have the length of the maximum rows of the column, I try to pass it to append using the min_itemsize argument:

 In [262]: for chunk in pd.read_csv('Train.csv', chunksize=50000, dtype=types): .....: store.append('df', chunk, min_itemsize=lens) Exception: cannot find the correct atom type -> [dtype->object,items->Index([Usa geBand, saledate, fiModelDesc, fiBaseModel, fiSecondaryDesc, fiModelSeries, fiMo delDescriptor, ProductSize, fiProductClassDesc, state, ProductGroup, ProductGrou pDesc, Drive_System, Enclosure, Forks, Pad_Type, Ride_Control, Stick, Transmissi on, Turbocharged, Blade_Extension, Blade_Width, Enclosure_Type, Engine_Horsepowe r, Hydraulics, Pushblock, Ripper, Scarifier, Tip_Control, Tire_Size, Coupler, Co upler_System, Grouser_Tracks, Hydraulics_Flow, Track_Type, Undercarriage_Pad_Wid th, Stick_Length, Thumb, Pattern_Changer, Grouser_Type, Backhoe_Mounting, Blade_ Type, Travel_Controls, Differential_Type, Steering_Controls], dtype=object)] [va lues_block_2] column has a min_itemsize of [64] but itemsize [58] is required!

The intruder column was passed min_itemsize of 64, but in exceptional situations, a size of 58 is required. Could this be an error?

In [266]: pd. Out version [266]: '0.11.0.dev-eb07c5a'

+4

python pandas hdf5

Zelazny7 Mar 18 '13 at 23:22

source share

1 answer

Jeff · Accepted Answer · 2013-03-19T12:17:25+0000

The link you provided works great to save the frame. Column by column means only data_columns = True. He will process the columns individually and raise on the offensive.

Diagnose

 store = pd.HDFStore('test0.h5','w') In [31]: for chunk in pd.read_csv('Train.csv', chunksize=10000): ....: store.append('df', chunk, index=False, data_columns=True)

During production, you probably want to limit the data_columns to the columns you want to query (maybe None, in which case you can only query by index / columns)

Update:

You may run into another problem. read_csv converts dtypes based on what it sees in each fragment, so append failed with a size of 10,000, as chunks 1 and 2 had integer search data in some columns, and then in chunk 3 you had some NaN, so that he is swimming. Either specify front types, use a larger chanksit, or do your operations twice to guarantee your data types between chunks.

I updated pytables.py to have a more useful exception in this case (also as indicated if the column has incompatible data)

thanks for the report!

How to eliminate the HDFStore exception: cannot find the correct atom type

More articles: