Why is a table sort request much slower after sorting it?

I have a Python program that uses Pytables and queries the table as follows:

def get_element(table, somevar): rows = table.where("colname == somevar") row = next(rows, None) if row: return elem_from_row(row) 

To reduce the query time, I decided to try sorting the table using table.copy(sortby='colname') . It really improved the request time (spent in where ), but it increased the time spent on the next() built-in function by several orders of magnitude! What could be the reason?

This deceleration occurs only when there is another column in the table, and deceleration increases with the size of the element of this other column.

To help me understand the problem and make sure that it is not related to anything else in my program, I made this minimal working example, reproducing the problem:

 #!/usr/bin/env python # -*- coding: utf-8 -*- import tables import time import sys def create_set(sort, withdata): #Table description with or without data tabledesc = { 'id': tables.UIntCol() } if withdata: tabledesc['data'] = tables.Float32Col(2000) #Create table with CSI'ed id fp = tables.open_file('tmp.h5', mode='w') table = fp.create_table('/', 'myset', tabledesc) table.cols.id.create_csindex() #Fill the table with sorted ids row = table.row for i in xrange(500): row['id'] = i row.append() #Force a sort if asked for if sort: newtable = table.copy(newname='sortedset', sortby='id') table.remove() newtable.rename('myset') fp.flush() return fp def get_element(table, i): #By construction, i always exists in the table rows = table.where('id == i') row = next(rows, None) if row: return {'id': row['id']} return None sort = sys.argv[1] == 'sort' withdata = sys.argv[2] == 'withdata' fp = create_set(sort, withdata) start_time = time.time() table = fp.root.myset for i in xrange(500): get_element(table, i) print("Queried the set in %.3fs" % (time.time() - start_time)) fp.close() 

And here is what console output shows the numbers:

 $ ./timedset.py nosort nodata Queried the set in 0.718s $ ./timedset.py sort nodata Queried the set in 0.003s $ ./timedset.py nosort withdata Queried the set in 0.597s $ ./timedset.py sort withdata Queried the set in 5.846s 

Some notes:

  • Rows are actually sorted in all cases, so they seem to be related to a table that knows about sorting, and not just sortable data.
  • If instead of creating a file I read it from disk, the same results.
  • The problem only occurs when a data column is present, although I never wrote or read it. I noticed that the time difference increases "in stages" when the size of the column (the number of floats) increases. The slowdown should be due to internal data or I / O movements: Data Column Size Query Time Function
  • If I don't use the next function, but instead use for row in rows and believe that there is only one result, the slowdown is still happening.

Accessing an element from a table using some sort of identifier (sorted or not) sounds like a basic function, I miss the typical way to do this with pytables. What is it? And why such a terrible slowdown? Is this a bug I should report?

+7
optimization python iterator pytables
source share
1 answer

I finally realized what was going on.

In short

The main reason was an error, and it was on my side: I did not purge the data before making a copy in case of sorting. As a result, the copy was based on incomplete data, as well as on a new sorted table. This led to a slowdown, and redness, if necessary, led to a less unexpected result:

 ... #Fill the table with sorted ids row = table.row for i in xrange(500): row['id'] = i row.append() fp.flush() # <-- #Force a sort if asked for if sort: newtable = table.copy(newname='sortedset', sortby='id') table.remove() newtable.rename('myset') fp.flush() # <-- return fp ... 

But why?

I realized my mistake when I decided to check and compare the structure and data of the “unsorted” and “sorted” tables. I noticed that in the sorted case there were fewer rows in the table. The number varied, seemingly randomly, from 0 to 450 depending on the size of the data column. Moreover, in the sorted table, the identifier of all rows was set to 0. I assume that when creating the table, pytables initializes the columns and may or may not pre-create some of the rows with some initial value. This “may or may not” probably depends on the size of the string and the computed chunksize .

As a result, when querying a sorted table, all queries except one with id == 0 had no result. Initially, I thought that raising and catching the StopIteration error was the reason for the slowdown, but this does not explain why the slowdown depends on the size of the data column.

After reading some code from pytables (especially table.py and tableextension.pyx ), I think the following happens: when the index is indexed, pytables will first try to use this index to pin the search. If several matching lines are found, only those lines will be read. But if the index indicates that no row matches the query, for some reason pytables returns a backup of the search in the kernel, which iterates and reads all the rows. This requires reading full lines from disk in multiple I / O, and therefore the value of the data column matters. Also, under a certain size of this column, pytables did not “pre-create” some rows on disk, resulting in a sorted table without any row at all. This is why the graph searches very quickly when the column size is less than 525: iterating over the line does not take much time.

I don’t understand why the iterator is backing down in the search “in the kernel”. If the identifier you are looking for explicitly goes beyond the index, I see no reason to look for it anyway ... Edit: After a closer look at the code, this is due to an error. It is present in the version I'm using (3.1.1), but has been fixed in 3.2.0 .

Irony

What really makes me cry is that I forgot to flash before copying only on the example of a question. In my real program this error is not! What I also did not know, but it turned out when investigating the issue, is that by default pytables do not distribute indexes. This must be explicitly specified using propindexes=True . This is why the search was slower after sorting in my application ...

So, the moral of this story:

  • Indexing is good: use it
  • But don't forget to distribute them when sorting the table
  • Before reading, make sure your data is on disk ...
+2
source share

All Articles