I finally realized what was going on.
In short
The main reason was an error, and it was on my side: I did not purge the data before making a copy in case of sorting. As a result, the copy was based on incomplete data, as well as on a new sorted table. This led to a slowdown, and redness, if necessary, led to a less unexpected result:
...
But why?
I realized my mistake when I decided to check and compare the structure and data of the “unsorted” and “sorted” tables. I noticed that in the sorted case there were fewer rows in the table. The number varied, seemingly randomly, from 0 to 450 depending on the size of the data column. Moreover, in the sorted table, the identifier of all rows was set to 0. I assume that when creating the table, pytables initializes the columns and may or may not pre-create some of the rows with some initial value. This “may or may not” probably depends on the size of the string and the computed chunksize .
As a result, when querying a sorted table, all queries except one with id == 0 had no result. Initially, I thought that raising and catching the StopIteration error was the reason for the slowdown, but this does not explain why the slowdown depends on the size of the data column.
After reading some code from pytables (especially table.py and tableextension.pyx ), I think the following happens: when the index is indexed, pytables will first try to use this index to pin the search. If several matching lines are found, only those lines will be read. But if the index indicates that no row matches the query, for some reason pytables returns a backup of the search in the kernel, which iterates and reads all the rows. This requires reading full lines from disk in multiple I / O, and therefore the value of the data column matters. Also, under a certain size of this column, pytables did not “pre-create” some rows on disk, resulting in a sorted table without any row at all. This is why the graph searches very quickly when the column size is less than 525: iterating over the line does not take much time.
I don’t understand why the iterator is backing down in the search “in the kernel”. If the identifier you are looking for explicitly goes beyond the index, I see no reason to look for it anyway ... Edit: After a closer look at the code, this is due to an error. It is present in the version I'm using (3.1.1), but has been fixed in 3.2.0 .
Irony
What really makes me cry is that I forgot to flash before copying only on the example of a question. In my real program this error is not! What I also did not know, but it turned out when investigating the issue, is that by default pytables do not distribute indexes. This must be explicitly specified using propindexes=True . This is why the search was slower after sorting in my application ...
So, the moral of this story:
- Indexing is good: use it
- But don't forget to distribute them when sorting the table
- Before reading, make sure your data is on disk ...