In my PostgreSQL 9.3 database, I have a table called articles . It looks something like this:
+------------+--------------------------------------------------------------+ | Name | Information | +------------+--------------------------------------------------------------+ | id | Auto incrememnt integer ID | | title | text | | category | character varying(255) with index | | keywords | String with title and extra words used for indexing | | tsv | Trigger updates w/ tsvector_update_trigger based on keywords | +------------+--------------------------------------------------------------+
There are more columns in the table, but I don't think they are critical to the question. The total size of the table is 94 GB and about 29 M rows.
I am trying to run a keyword search query in a subset of 23M article lines. for this I use the following query:
SELECT title, id FROM articles, plainto_tsquery('dog') AS q WHERE (tsv @@ q) AND category = 'animal' ORDER BY ts_rank_cd(tsv, q) DESC LIMIT 5
The problem is that it appears when you run ts_rank_cd for each of the results before sorting them, and therefore this query is very slow, about 2-3 minutes. I read a lot to try to find a solution, and I was asked to wrap the search query in another query so that the ranking applies only to the results found as follows:
SELECT * FROM ( SELECT title, id, tsv FROM articles, plainto_tsquery('dog') AS q WHERE (tsv @@ q) AND category = 'animal' ) AS t1 ORDER BY ts_rank_cd(t1.tsv, plainto_tsquery('dog')) DESC LIMIT 5;
However, since the query is so short, there are 450K results in a subset. So it still takes a lot of time, it can be a little faster, but I need it to be almost instantaneous.
Question: is there anything I can do to save this search function in PostgreSQL?
It's good that this logic is stored in the database and means that I do not require additional servers or configuration for something like Solr or Elasticsearch. For example, will an instance of a database increase help? Or cost-effectiveness does not make sense compared to porting this logic to a special instance of Elasticsearch.
The EXPLAIN response from the first request is as follows:
Limit (cost=567539.41..567539.42 rows=5 width=465) -> Sort (cost=567539.41..567853.33 rows=125568 width=465) Sort Key: (ts_rank_cd(articles.tsv, qq)) -> Nested Loop (cost=1769.27..565453.77 rows=125568 width=465) -> Function Scan on plainto_tsquery q (cost=0.00..0.01 rows=1 width=32) -> Bitmap Heap Scan on articles (cost=1769.27..563884.17 rows=125567 width=433) Recheck Cond: (tsv @@ qq) Filter: ((category)::text = 'animal'::text) -> Bitmap Index Scan on article_search_idx (cost=0.00..1737.87 rows=163983 width=0) Index Cond: (tsv @@ qq)
And for the second request:
Aggregate (cost=565453.77..565453.78 rows=1 width=0) -> Nested Loop (cost=1769.27..565139.85 rows=125568 width=0) -> Function Scan on plainto_tsquery q (cost=0.00..0.01 rows=1 width=32) -> Bitmap Heap Scan on articles (cost=1769.27..563884.17 rows=125567 width=351) Recheck Cond: (tsv @@ qq) Filter: ((category)::text = 'animal'::text) -> Bitmap Index Scan on article_search_idx (cost=0.00..1737.87 rows=163983 width=0) Index Cond: (tsv @@ qq)