Keyboard-based dictionary comparisons

Question

Keyboard-based dictionary comparisons

I have a list of "entries" like this

data = [ {'id':1, 'name': 'A', 'price': 10, 'url': 'foo'}, {'id':2, 'name': 'A', 'price': 20, 'url': 'bar'}, {'id':3, 'name': 'A', 'price': 30, 'url': 'baz'}, {'id':4, 'name': 'A', 'price': 10, 'url': 'baz'}, {'id':5, 'name': 'A', 'price': 20, 'url': 'bar'}, {'id':6, 'name': 'A', 'price': 30, 'url': 'foo'}, {'id':7, 'name': 'A', 'price': 99, 'url': 'quu'}, {'id':8, 'name': 'B', 'price': 10, 'url': 'foo'}, ]

I want to delete entries that are "duplicates", where equality is determined by a list of logical conditions. Each item in the list is an OR clause, and all AND items are together. For example:

 filters = [ ['name'], ['price', 'url'] ]

means that two entries are considered equal if their name AND (their price OR URL) are equal. For the above example:

 For item 1 the duplicates are 4 (by name and price) and 6 (name+url) For item 2 - 5 (name+price, name+url) For item 3 - 4 (name+url) and 6 (name+price) For item 7 there are no duplicates (neither price nor url match) For item 8 there are no duplicates (name doesn't match)

Thus, the resulting list should contain elements 1, 2, 3, 7, and 8.

Please note that

there may be more conditions AND: ['name'], ['price', 'url'], ['weight'], ['size'], ...
OR groups in a condition list can be longer than two elements, for example. ['name'], ['price', 'url', 'weight']...
the list of sources is very long, O(n^2) alogirthm is out of the question

+7

python dictionary algorithm

georg Dec 16 '13 at 21:11

source share

2 answers

To avoid this in O(n^2) , you need to create an index for each query that you want to make. After the machine will request any value for a constant time, your O(n^2) turns into O(n) trivially. And you can also build all indexes in O(n) .

Assuming each of your values has the same fields, it will look like this:

 indices = defaultdict(lambda: defaultdict(set)) for i, row in enumerate(data): for field in 'id', 'name', 'price', 'url': key = row[field] indices[field][key].add(i)

Now, to find a specific value, it is simple:

 def search(field, key): return (data[index] for index in indices[field][key])

To find a group of or ed values together, simply search individually and set.union them together, for example:

 def search_disj(factors): sets = (indices[field][key] for field, key in factors) return (data[index] for index in reduce(set.union, sets))

And to find the disjunction group and ed together, do the same for each, and then set.intersection all the results together.

Depending on your data, it may be more efficient to simply look at the first index and then linearly search for results for other factors. You can optimize this by changing the fields so that you first look for the smallest len(indices[field]) . (Or, in this case, with the smallest sum (len (indices [field]) for the field in disj).)

If you can arbitrarily embed conjunctions of disjunctions of conjunctions ... until you move to single elements, you simply call either other mutually recursive ones (with a base case for flat elements). You can even expand it to a completely logical search (although you will also need the not - universe - indices[field][key] operation, where universe = set(range(len(data))) is for this).

If the data is very large, you cannot store all indexes in memory.

Or, even if you can store all indexes in memory, caching and even page skipping can make the hash table less ideal, in which case you probably want to consider something based on the B-tree (e.g. blist.sorteddict ) instead of dict. It also gives you the advantage that you can search for ranges of values, order results, etc. The disadvantage is that all those n times become n log n , but if you need functionality or you get two orders of magnitude - profitable terrain in exchange for the cost of log(n, base) , which turns out to be only 7, it may be worth it.

Or, conversely, use some kind of disk storage with disk support, for example anydbm .

However, really, what you are building is a relational database with a single relation (table). In many cases, you would be better off using only off-the-shelf relational databases such as sqlite3 , which Python comes with built-in. Then the code to create the index is as follows:

 db.execute('CREATE INDEX id_idx ON data (id)')

... and you can just make queries, and they magically use the right indexes in the best way:

 curs = db.execute('SELECT * FROM data WHERE name = ? AND (price = ? OR url = ?)', filters)

+8

abarnert Dec 16 '13 at 21:32

source share

georg · Accepted Answer · 2013-12-17T11:39:13+0000

Based on the idea of Tim Pitzker, the following works for me:

We start by converting the CNF condition of type a&(b|c) to DNF: (a&b)|(a&c) . Using list notation, as in the question, i.e. [ [a], [b, c] ] , DNF will be [ [a, b], [a, c] ] . In python, it is as simple as itertools.product(*filters) .

Then we iterate over the list and for each conjunction in DNF create a composite key:

 ( (a, rec[a]), (b, rec[b]) )

and check if any of the keys has already been detected. If not, we consider the entry to be unique and add its keys to the set seen :

The code:

 seen = set() dnf = list(itertools.product(*filters)) for item in data: keys = set( tuple((field, item.get(field, None)) for field in conjunct) for conjunct in dnf) if keys.isdisjoint(seen): seen |= keys print item # unique

Kudos to Tim for giving me an idea. If anyone sees any problems with this solution, let me know.

Keyboard-based dictionary comparisons

More articles: