String Exception Request Optimization

I am developing a basic read-only database containing 300,000 documents with approximately 50,000 different tags, with each document having an average of 15 tags. At the moment, the only question that interests me is the selection of all documents without a tag from a given set of tags. I'm only interested in the document_id column (no other columns as a result).

My schema is essentially:

 CREATE TABLE documents ( document_id SERIAL PRIMARY KEY, title TEXT ); CREATE TABLE tags ( tag_id SERIAL PRIMARY KEY, name TEXT UNIQUE ); CREATE TABLE documents_tags ( document_id INTEGER REFERENCES documents, tag_id INTEGER REFERENCES tags, PRIMARY KEY (document_id, tag_id) ); 

I can write this query in Python, having previously computed a set of documents for a given tag, which will reduce the problem to a few quick operations:

 In [17]: %timeit all_docs - (tags_to_docs[12345] | tags_to_docs[7654]) 100 loops, best of 3: 13.7 ms per loop 

Translation of operations in Postgres does not work so fast:

 stuff=# SELECT document_id AS id FROM documents WHERE document_id NOT IN ( stuff(# SELECT documents_tags.document_id AS id FROM documents_tags stuff(# WHERE documents_tags.tag_id IN (12345, 7654) stuff(# ); document_id --------------- ... Time: 201.476 ms 
  • Replacing NOT IN with EXCEPT makes it even slower.
  • I have btree indexes on document_id and tag_id in all three tables, and the other on (document_id, tag_id) .
  • The default limits for the Postgres process are significantly increased, so I donโ€™t think Postgres is configured incorrectly.

How to speed up this request? Is there a way to pre-compute the mapping between what I was doing with Python, or am I thinking about it wrong?


Here is the result of EXPLAIN ANALYZE :

 EXPLAIN ANALYZE SELECT document_id AS id FROM documents WHERE document_id NOT IN ( SELECT documents_tags.documents_id AS id FROM documents_tags WHERE documents_tags.tag_id IN (12345, 7654) ); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on documents (cost=20280.27..38267.57 rows=83212 width=4) (actual time=176.760..300.214 rows=20036 loops=1) Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 146388 SubPlan 1 -> Bitmap Heap Scan on documents_tags (cost=5344.61..19661.00 rows=247711 width=4) (actual time=32.964..89.514 rows=235093 loops=1) Recheck Cond: (tag_id = ANY ('{12345,7654}'::integer[])) Heap Blocks: exact=3300 -> Bitmap Index Scan on documents_tags__tag_id_index (cost=0.00..5282.68 rows=247711 width=0) (actual time=32.320..32.320 rows=243230 loops=1) Index Cond: (tag_id = ANY ('{12345,7654}'::integer[])) Planning time: 0.117 ms Execution time: 303.289 ms (11 rows) Time: 303.790 ms 

The only settings I changed from the default configuration were as follows:

 shared_buffers = 5GB temp_buffers = 128MB work_mem = 512MB effective_cache_size = 16GB 

Running Postgres 9.4.5 on a server with 64 GB of RAM.

+5
source share
2 answers

Optimize tuning for read performance

Your memory settings seem reasonable for a server with 64 GB, with the possible exception of work_mem = 512MB . This is high. Your queries are not particularly complex, and your tables are not so large.

4.5 million rows (300k x 15) in a simple documents_tags connection table should occupy ~ 156 MB and PK another 96 MB. For your query, you usually do not need to read the entire table, only small parts of the index. For " mostly read-only ", as with you, you should only see indexing only in the PK index. You don't need nearly as much work_mem - which might not really matter - unless you have a lot of concurrent requests. Quote guide :

... several work sessions can perform such operations simultaneously. Therefore, the total used memory can be many times greater than the value of work_mem ; this must be considered when choosing a value.

Setting work_mem too high can damage performance:

I suggest reducing work_mem to 128 MB or less to avoid possible memory starvation - unless you have other general requests that require more. You can always install it locally for special requests.

There are several other angles for optimizing reading performance:

Main issue: leading index column

All of this may help a little. But the main problem is this:

 PRIMARY KEY (document_id, tag_id ) 

300k documents, 2 tags for exclusion. Ideally, you have an index with tag_id as the leading column and document_id as the second. With an index only (tag_id) you cannot get only indexing. If this request is your only use case, change your PC as shown below.

Or perhaps even better: you can create an additional simple index on (tag_id, document_id) if you need both - and omit the other two indexes on documents_tags only on (tag_id) and (document_id) . They do not offer anything over two multi-column indexes. The remaining indices 2 (in contrast to indices 3 ) are smaller and superior to all. Justification:

At the same time, I also propose a CLUSTER table using the new PK, all in one transaction, possibly with some additional maintenance_work_mem locally:

 BEGIN; SET LOCAL maintenance_work_mem = '256MB'; ALTER TABLE documents_tags DROP CONSTRAINT documents_tags_pkey , ADD PRIMARY KEY ( tag_id , document_id); -- tag_id first. CLUSTER documents_tags USING documents_tags_pkey; COMMIT; 

Do not forget:

 ANALYZE documents_tags; 

Inquiries

The request itself is launched. Here are four standard methods:

NOT IN - to quote:

Only for small sets with no null values

Your usage example: all involved columns are NOT NULL , and the list of excluded items is very short. Your original request is an ardent contender.

NOT EXISTS and LEFT JOIN / IS NULL always hot contenders. Both were suggested in other answers. LEFT JOIN must be the actual LEFT [OUTER] JOIN .

EXCEPT ALL will be shorter, but often not so fast.

1. DOES NOT INCLUDE
 SELECT document_id FROM documents d WHERE document_id NOT IN ( SELECT document_id -- no need for column alias, only value is relevant FROM documents_tags WHERE tag_id IN (12345, 7654) ); 
2. DOES NOT EXIST
 SELECT document_id FROM documents d WHERE NOT EXISTS ( SELECT 1 FROM documents_tags WHERE document_id = d.document_id AND tag_id IN (12345, 7654) ); 
3. LEFT JOIN / IS NULL
 SELECT d.document_id FROM documents d LEFT JOIN documents_tags dt ON dt.document_id = d.document_id AND dt.tag_id IN (12345, 7654) WHERE dt.document_id IS NULL; 
4. EXCEPT ALL
 SELECT document_id FROM documents EXCEPT ALL -- ALL, to keep duplicate rows and make it faster SELECT document_id FROM documents_tags WHERE tag_id IN (12345, 7654); 


Benchmark

I ran a quick test on my old laptop with 4 GB of RAM and Postgres 9.5.3 to test my theories:

Test setup

 SET random_page_cost = 1.1; SET work_mem = '128MB'; CREATE SCHEMA temp; SET search_path = temp, public; CREATE TABLE documents ( document_id serial PRIMARY KEY, title text ); -- CREATE TABLE tags ( ... -- actually irrelevant for this query CREATE TABLE documents_tags ( document_id integer REFERENCES documents, tag_id integer -- REFERENCES tags -- irrelevant for test -- no PK yet, to test seq scan -- it also faster to create the PK after filling the big table ); INSERT INTO documents (title) SELECT 'some dummy title ' || g FROM generate_series(1, 300000) g; INSERT INTO documents_tags(document_id, tag_id) SELECT i.* FROM documents d CROSS JOIN LATERAL ( SELECT DISTINCT d.document_id, ceil(random() * 50000)::int FROM generate_series (1,15)) i; ALTER TABLE documents_tags ADD PRIMARY KEY (document_id, tag_id); -- your current index ANALYZE documents_tags; ANALYZE documents; 

Note that the lines in documents_tags physically grouped using document_id due to the way I populated the table, which is also likely in your current situation.

Test

3 test runs with each of 4 queries, best of all 5 times, to exclude caching effects.

Test 1: With documents_tags_pkey as you have. The index and physical order of the rows is bad for our query.
Test 2: Create a PK on (tag_id, document_id) as suggested.
Test 3: CLUSTER on the new PC. EXPLAIN ANALYZE execution time in ms:

  time in ms |  Test 1 |  Test 2 |  Test 3
 1. NOT IN |  654 |  70 |  71 - winner!
 2. NOT EXISTS |  684 |  103 |  97
 3. LEFT JOIN |  685 |  98 |  99
 4. EXCEPT ALL |  833 |  255 |  250 

conclusions

  • The key element is the right index with the leading tag_id - for queries with multiple tag_id and many document_id .
    To be precise, it doesnโ€™t matter that there is a more distinct document_id than tag_id . It could be the other way around. Btree indexes basically do the same with any column order. This is the fact that the most selective predicate in your query filters is tag_id . And it is faster on leading index columns.

  • The winning request for multiple tag_id for exception is your original with NOT IN .

  • NOT EXISTS and LEFT JOIN / IS NULL result in the same query plan. For more than a few dozen identifiers, I expect them to scale better ...

  • In a read-only situation, you will only see checking only indexes , so the physical order of the rows in the table becomes irrelevant. Therefore, test 3 did not bring any additional improvements.

  • If writing to the table occurs and autovacuum cannot keep up, you will see a (raster) scan of the index . Physical clustering is important to them.

+5
source

Use an outer join, with the tag condition in the join, leaving only the missing joins to return to where none of the specified tags match:

 select d.id from documents d join documents_tags t on t.document_id = d.id and t.tag_id in (12345, 7654) where t.document_id is null 
+1
source

All Articles