PostgreSQL full-text search efficiency unacceptable when ordering ts_rank_cd

In my PostgreSQL 9.3 database, I have a table called articles . It looks something like this:

 +------------+--------------------------------------------------------------+ | Name | Information | +------------+--------------------------------------------------------------+ | id | Auto incrememnt integer ID | | title | text | | category | character varying(255) with index | | keywords | String with title and extra words used for indexing | | tsv | Trigger updates w/ tsvector_update_trigger based on keywords | +------------+--------------------------------------------------------------+ 

There are more columns in the table, but I don't think they are critical to the question. The total size of the table is 94 GB and about 29 M rows.

I am trying to run a keyword search query in a subset of 23M article lines. for this I use the following query:

 SELECT title, id FROM articles, plainto_tsquery('dog') AS q WHERE (tsv @@ q) AND category = 'animal' ORDER BY ts_rank_cd(tsv, q) DESC LIMIT 5 

The problem is that it appears when you run ts_rank_cd for each of the results before sorting them, and therefore this query is very slow, about 2-3 minutes. I read a lot to try to find a solution, and I was asked to wrap the search query in another query so that the ranking applies only to the results found as follows:

 SELECT * FROM ( SELECT title, id, tsv FROM articles, plainto_tsquery('dog') AS q WHERE (tsv @@ q) AND category = 'animal' ) AS t1 ORDER BY ts_rank_cd(t1.tsv, plainto_tsquery('dog')) DESC LIMIT 5; 

However, since the query is so short, there are 450K results in a subset. So it still takes a lot of time, it can be a little faster, but I need it to be almost instantaneous.

Question: is there anything I can do to save this search function in PostgreSQL?

It's good that this logic is stored in the database and means that I do not require additional servers or configuration for something like Solr or Elasticsearch. For example, will an instance of a database increase help? Or cost-effectiveness does not make sense compared to porting this logic to a special instance of Elasticsearch.


The EXPLAIN response from the first request is as follows:

 Limit (cost=567539.41..567539.42 rows=5 width=465) -> Sort (cost=567539.41..567853.33 rows=125568 width=465) Sort Key: (ts_rank_cd(articles.tsv, qq)) -> Nested Loop (cost=1769.27..565453.77 rows=125568 width=465) -> Function Scan on plainto_tsquery q (cost=0.00..0.01 rows=1 width=32) -> Bitmap Heap Scan on articles (cost=1769.27..563884.17 rows=125567 width=433) Recheck Cond: (tsv @@ qq) Filter: ((category)::text = 'animal'::text) -> Bitmap Index Scan on article_search_idx (cost=0.00..1737.87 rows=163983 width=0) Index Cond: (tsv @@ qq) 

And for the second request:

 Aggregate (cost=565453.77..565453.78 rows=1 width=0) -> Nested Loop (cost=1769.27..565139.85 rows=125568 width=0) -> Function Scan on plainto_tsquery q (cost=0.00..0.01 rows=1 width=32) -> Bitmap Heap Scan on articles (cost=1769.27..563884.17 rows=125567 width=351) Recheck Cond: (tsv @@ qq) Filter: ((category)::text = 'animal'::text) -> Bitmap Index Scan on article_search_idx (cost=0.00..1737.87 rows=163983 width=0) Index Cond: (tsv @@ qq) 
+8
sql postgresql full-text-search
source share
3 answers

You simply cannot use the index on ts_rank_cd, because the received ranking value depends on it depending on your request. Therefore, all rank values ​​for the entire result set must be computed each time you run the query before the result set can be sorted and limited by this value.

If your search logic avoids this bottleneck, precommit the relevance value for each entry once, create an index above it and use it as a sort column instead of cover sensitivity for each query.

Even if you said you didn’t want to, I suggest you look into a search engine that could work with Postgresql, such as Sphinx. By default, the BM25 rating should work fine. You can also set the weight of the columns if you need ( http://sphinxsearch.com/docs/current.html#api-func-setfieldweights ).

Update: this is also indicated in the documentation:

"Ranking can be expensive, as it requires consultation with TSvector of each relevant document, which may be related to I / O and therefore slow. Unfortunately, it is almost impossible to avoid, because practical queries often lead to a lot of matches."

See http://www.postgresql.org/docs/8.3/static/textsearch-controls.html

+2
source share

Maybe ... Your category section can be optimized if you use the HASH index, your tsv request can be optimized using the GIN index, if your category is a (rather small) finite set, maybe you should use an enumeration for the category instead of changing (or at least not using varchar). (I wonder if weight really matters in your case).

 SELECT * FROM (SELECT *,ts_rank_cd(sub.tsv, plainto_tsquery('dog')) AS rank FROM (SELECT title,id,tsv FROM articles WHERE category = 'animal')) AS sub, plainto_tsquery('dog') AS q WHERE (tsv @@ q) ORDER BY rank DESC LIMIT 5 
0
source share

You should index the category column, and you can try to increase the working memory for this particular query to avoid scanning the bitmap heap if the category does not slow it down:

SET LOCAL work_mem = '64MB';

This can significantly increase memory usage if the query is executed at the same time.

0
source share

All Articles