How to effectively compare and search a list of integers?

I have a database with 1 million objects. Each object has a field "tags" - a set of integers.

For instance:

object1: tags(1,3,4) object2: tags(2) object3: tags(3,4) object4: tags(5) 

etc.

The query parameter is a set for integers, allows you to try q (3,4,5)

 object1 does not match ('1' not in '3,4,5') object2 does not match ('2' not in '3,4,5') object3 matches ('3 and 4' in '3,4,5' ) object4 matches ('5' in '3,4,5' ) 

How to effectively select suitable objects?

+4
source share
7 answers

You make a common mistake in database design by keeping a comma-delimited list of tag identifiers. Not surprisingly, making effective requests against this is a blocker for you.

You need to model the mapping between objects and tags in a separate table.

 CREATE TABLE Tagged ( object_id INT NOT NULL, tag_id INT NOT NULL, PRIMARY KEY (object_id, tag_id), FOREIGN KEY (object_id) REFERENCES Objects(object_id), FOREIGN KEY (tag_id) REFERENCES Tags(tag_id) ); 

Insert one line for each pairing of objects / tags. Of course, this means that you have several lines for each object_id , but that’s good.

You can request all objects with tags 3,4,5:

 SELECT DISTINCT object_id FROM Tagged WHERE tag_id IN (3, 4, 5); 

But this corresponds to object1 that you do not want. You want to exclude objects that have other tags not in 3,4,5.

 SELECT DISTINCT t1.object_id FROM Tagged t1 LEFT OUTER JOIN Tagged t2 ON (t1.object_id = t2.object_id AND t2.tag_id NOT IN (3, 4, 5)) WHERE t1.tag_id IN (3, 4, 5) AND t2.object_id IS NULL; 
+3
source

Given that you are using PostgreSQL, you can use its array data type and its contain / overlap statements.

Of course, this will strongly tie your application to PostgreSQL, which may be undesirable. On the other hand, it can save your coding when you really need it (i.e. when you finally have to transfer it to another database)

Although, given that in Python you have to set the data type for this exact group of operations, using PostgreSQL can be overwhelming (depending on performance requirements)

 >>> a = set([1,2,3]) >>> a set([1, 2, 3]) >>> 1 in a True >>> set([1,2]) in a False >>> set([2,3]) & a set([2, 3]) >>> set([8,9]) & a set([]) >>> set([1,3]) & a set([1, 3]) >>> 
+3
source

If I understood correctly, this is something like:

Post-> posttags <-tags

view of the circuit.

I wonder why you do that?

This is the problem you are running into because you are using an ORM that retrieves data in objects and other lazy loaded related objects.

Like the Post and Tag classes in SQLAlchemy, there is a Post agent with a property called "tags" that can load many Tag objects for a given Post object.

If so, such operations are usually very expensive in ORM and should be performed with SQL Statement ORM support or using direct dbapi like psycopg2. Again, if the number of objects loaded from the request is huge (meaning your 1 million), you need a machine with lots of resources (or maybe nothing at all - a clean ORM is not recommended).

If its not ORM and your tags are stored as (installs), then I think that something is wrong with this scheme.

posttags is a many-to-many relationship, as I see it, and its separate table in itself (which is easily requested), and not a β€œset” in the message table.

+1
source

You did not specify the weather you want to use SQL, or if you are reading data in the application before. Of the sounds of the things you are looking for a code based solution?

In .NET, you must make a class that implements the ICompare interface and write your own method for comparing two values ​​that either return 0 or 1.

0
source

This is the basic set theory. Cross the two sets, and if the result is the same as the original, then the result will "match." Otherwise it is not.

This principle can be applied in many languages. Most of them have libraries for working with sets. You can even do this with SQL.

0
source

It seems that the issubset() sets method is what you are looking for:

 tags(1, 2, 3).issubset(q(1, 2, 3, 4)) 

If both tags and q are subclasses of set . But I agree with other answers that solving this in a database would be a better solution.

0
source

Sorry. It seemed like it was hard for me to explain the problem well :)

The postgresql tag is much more meaningful here than python. A stand-alone TAG table with an IS NULL clause is what I really need.

SQLalchemy is also recommended.

Thanks to everyone.

0
source

All Articles