Effective way to verify that entries are unique in Python / PyTables

I have a table in PyTables with ~ 50 million records. The combination of the two fields (in particular, the user ID and date) should be unique (i.e. the user should have no more than one record per day), but I need to check that this is really so.

As an illustration, my table looks like this:

userID | date A | 1 A | 2 B | 1 B | 2 B | 2 <- bad! Problem with the data! 

Additional Information:

  • Currently, the table is sorted mainly.
  • I can just pull one column into memory like a numpy array, but I can't pull two into memory at the same time.
  • Both user IDs and dates are integers
+4
python
source share
3 answers

It seems that the indexes in PyTables are limited to individual columns.

I would suggest adding a hash column and putting an index on it. Your unique data is defined as the concatenation of other columns in the database. Separators ensure that there are no two different rows that give the same unique data. A hash column can only be this unique row, but if your data is long, you will want to use a hash function. A quick hash function like md5 or sha1 is great for this application.

Calculate the hashed data and check if it is in the database. If so, you know that you hit some duplicate data. If not, you can safely add it.

+4
source share

So, over the years, I still have the same question, but with the ability to index and poll this problem is only a little painful, depending on the size of your table. Using readWhere or getListWhere, I think the problem is approximately O (n)

Here's what I did ... 1. I created a table that had two pointers. can use multiple pointers in PyTables:

http://pytables.github.com/usersguide/optimization.html#indexed-searches

Once your table is indexed , I also use LZO compression, you can do the following:

 import tables h5f = tables.openFile('filename.h5') tbl = h5f.getNode('/data','data_table') # assumes group data and table data_table counter += 0 for row in tbl: ts = row['date'] # timestamp (ts) or date uid = row['userID'] query = '(date == %d) & (userID == "%s")' % (ts, uid) result = tbl.readWhere(query) if len(result) > 1: # Do something here pass counter += 1 if counter % 1000 == 0: print '%d rows processed' 

Now the code that I wrote here is actually slow. I'm sure there is some PyTables guru who can give you a better answer. But here are my thoughts on performance:

If you know that you start with pure data, i.e. (no duplicates), all you have to do is query the table once for the keys that interest you in the search, which means that you only need:

 ts = row['date'] # timestamp (ts) or date uid = row['userID'] query = '(date == %d) & (userID == "%s")' % (ts, uid) result = tbl.getListWhere(query) if len(result) == 0: # key pair is not in table # do what you were going to do pass elif len(result) > 1: # Do something here, like get a handle to the row and update instead of append. pass 

If you have a lot of time to check for duplicates, create a background process that scans the directory with your files and looks for duplicates.

I hope this helps someone else.

+1
source share

I don't know much about PyTables, but I would try this approach

  • For each user ID get all the pairs (userID, date)
  • assert len(rows)==len(set(rows)) - this statement is executed if all (userID, date) tuples contained in the rows list are unique
0
source share

All Articles