So, over the years, I still have the same question, but with the ability to index and poll this problem is only a little painful, depending on the size of your table. Using readWhere or getListWhere, I think the problem is approximately O (n)
Here's what I did ... 1. I created a table that had two pointers. can use multiple pointers in PyTables:
http://pytables.github.com/usersguide/optimization.html#indexed-searches
Once your table is indexed , I also use LZO compression, you can do the following:
import tables h5f = tables.openFile('filename.h5') tbl = h5f.getNode('/data','data_table') # assumes group data and table data_table counter += 0 for row in tbl: ts = row['date'] # timestamp (ts) or date uid = row['userID'] query = '(date == %d) & (userID == "%s")' % (ts, uid) result = tbl.readWhere(query) if len(result) > 1: # Do something here pass counter += 1 if counter % 1000 == 0: print '%d rows processed'
Now the code that I wrote here is actually slow. I'm sure there is some PyTables guru who can give you a better answer. But here are my thoughts on performance:
If you know that you start with pure data, i.e. (no duplicates), all you have to do is query the table once for the keys that interest you in the search, which means that you only need:
ts = row['date']
If you have a lot of time to check for duplicates, create a background process that scans the directory with your files and looks for duplicates.
I hope this helps someone else.
aquil.abdullah
source share