SQL to calculate Tanimoto coefficient of multiple vectors

It seems to me that it is easier to explain my problem with an example.

I have one table with ingredients for recipes, and I have implemented a function to calculate the Tanimoto coefficient between the ingredients. It is fast enough to calculate the coefficient between two ingredients (3 sql queries are required), but it does not scale well. To calculate the coefficient between all possible combinations of ingredients, he needs N + (N * (N-1)) / 2 queries or 500,500 queries for just 1,000 ingredients. Is there a faster way to do this? Here is what I got so far:

class Filtering(): def __init__(self): self._connection=sqlite.connect('database.db') def n_recipes(self, ingredient_id): cursor = self._connection.cursor() cursor.execute('''select count(recipe_id) from recipe_ingredient where ingredient_id = ? ''', (ingredient_id, )) return cursor.fetchone()[0] def n_recipes_intersection(self, ingredient_a, ingredient_b): cursor = self._connection.cursor() cursor.execute('''select count(drink_id) from recipe_ingredient where ingredient_id = ? and recipe_id in ( select recipe_id from recipe_ingredient where ingredient_id = ?) ''', (ingredient_a, ingredient_b)) return cursor.fetchone()[0] def tanimoto(self, ingredient_a, ingredient_b): n_a, n_b = map(self.n_recipes, (ingredient_a, ingredient_b)) n_ab = self.n_recipes_intersection(ingredient_a, ingredient_b) return float(n_ab) / (n_a + n_b - n_ab) 
+4
source share
4 answers

Why don't you just take all the recipes into memory and then calculate the Tanimoto coefficients in memory?

It is simpler and much, much faster.

+4
source

If anyone is interested, this is the code that I came up with after the suggestions of Alex and S. Lotts. Thanks guys.

 def __init__(self): self._connection=sqlite.connect('database.db') self._counts = None self._intersections = {} def inc_intersections(self, ingredients): ingredients.sort() lenght = len(ingredients) for i in xrange(1, lenght): a = ingredients[i] for j in xrange(0, i): b = ingredients[j] if a not in self._intersections: self._intersections[a] = {b: 1} elif b not in self._intersections[a]: self._intersections[a][b] = 1 else: self._intersections[a][b] += 1 def precompute_tanimoto(self): counts = {} self._intersections = {} cursor = self._connection.cursor() cursor.execute('''select recipe_id, ingredient_id from recipe_ingredient order by recipe_id, ingredient_id''') rows = cursor.fetchall() print len(rows) last_recipe = None for recipe, ingredient in rows: if recipe != last_recipe: if last_recipe != None: self.inc_intersections(ingredients) last_recipe = recipe ingredients = [ingredient] else: ingredients.append(ingredient) if ingredient not in counts: counts[ingredient] = 1 else: counts[ingredient] += 1 self.inc_intersections(ingredients) self._counts = counts def tanimoto(self, ingredient_a, ingredient_b): if self._counts == None: self.precompute_tanimoto() if ingredient_b > ingredient_a: ingredient_b, ingredient_a = ingredient_a, ingredient_b n_a, n_b = self._counts[ingredient_a], self._counts[ingredient_b] n_ab = self._intersections[ingredient_a][ingredient_b] print n_a, n_b, n_ab return float(n_ab) / (n_a + n_b - n_ab) 
+3
source

If you have 1000 ingredients, 1000 queries will suffice to match each ingredient with a set of recipes in mind. If (let's say) an ingredient, as a rule, is about 100 recipes, each set will occupy several kilobytes, so the entire dictionary will take only a few MB - absolutely no problem to save all this in memory (and still not a serious memory problem, if the average the number of recipes per ingredient is growing by an order of magnitude).

 result = dict() for ing_id in all_ingredient_ids: cursor.execute('''select recipe_id from recipe_ingredient where ingredient_id = ?''', (ing_id,)) result[ing_id] = set(r[0] for r in cursor.fetchall()) return result 

After these 1000 queries, each of the required 500,000 calculations of the Tanimoto pair coefficients is then obviously performed in memory - you can pre-copy the squares of the lengths of the various sets as an additional acceleration (and leave them in a different dict), and use the β€œComponent dotproduct B” key for each pair is, of course, the length of the intersection of the sets.

+1
source

I think this will reduce you to 2 samples per pair for intersection and 4 queries per common pair. You cannot get away from O (N ^ 2), since you are trying to use all pairs - N * (N-1) / 2 - this is just the number of pairs.

 def n_recipes_intersection(self, ingredient_a, ingredient_b): cursor = self._cur cursor.execute(''' select count(recipe_id) from recipe_ingredient as A join recipe_ingredient as B using (recipe_id) where A.ingredient_id = ? and B.ingredient_id = ?; ''', (ingredient_a, ingredient_b)) return cursor.fetchone()[0] 
0
source

All Articles