How to index a column of a string array for pg_trgm `` term '% ANY (array_column) `query?

Question

How to index a column of a string array for pg_trgm `` term '% ANY (array_column) `query?

I tried the regular Postgres gin index, as well as the pg_trgm gin_trgm_ops and gist_trgm_ops (using this workaround: https://stackoverflow.com/a/146609/ ).

However, EXPLAIN in my query 'term' % ANY (array_column) shows a sequential scan even after running set enable_seqscan = off; .

(For my use case, I need partial matches, and pg_trgm seems much better than full-text search because my data is not linguistic. The quality of my pg_trgm results is very good.)

My use case is rows with an array column containing a combination of first names and full names (delimited by delimiters). The search term can be the first, last or full name (delimited by delimiters). The results of the pg_trgm% operator are case insensitive and appear to have weight matches at the beginning and end of the names in the column of the array, which is great for full names because it finds matching names and surnames, but not necessarily middle names.

https://github.com/theirix/parray_gin is promising, but old and does not claim to support Postgres newer than 9.2.

+7

postgresql

Gabe Kopley Sep 13 '16 at 23:52

source share

2 answers

I created a test table and a function f that converts only text.

 CREATE OR REPLACE FUNCTION getNArray(el text[], count int) RETURNS text[] AS $$ SELECT array_agg(el[random()*(array_length(el,1)-1)+1]) FROM generate_series(1,count) g(i) $$ VOLATILE LANGUAGE SQL; DROP TABLE testGin; CREATE TABLE testGin(id serial PRIMARY KEY, array_column text[]); WITH t(ray) AS( SELECT (string_to_array(pg_read_file('words.list')::text,E'\n')) ) INSERT INTO testGin(array_column) SELECT getNArray(T.ray, 4) FROM T, generate_series(1,100000);

Broadcast Function:

 CREATE OR REPLACE FUNCTION f(arr text[]) RETURNS text AS $$ SELECT arr::text LANGUAGE SQL IMMUTABLE; CREATE INDEX ON testGin USING GIN(f(array_column) gin_trgm_ops);

Use with ILIKE:

 postgres=# EXPLAIN SELECT id FROM testgin WHERE f(array_column) ilike '%test%'; QUERY PLAN ------------------------------------------------------------------------------- Bitmap Heap Scan on testgin (cost=34.82..1669.63 rows=880 width=4) Recheck Cond: (f(array_column) ~~* '%test%'::text) -> Bitmap Index Scan on testgin_f_idx (cost=0.00..34.60 rows=880 width=0) Index Cond: (f(array_column) ~~* '%test%'::text) (4 rows)

If you want a more accurate search by including the% operator, you can do this below. This will scan the index and then apply your filter:

 postgres=# explain SELECT id, array_column FROM testgin WHERE 'response' % ANY (array_column) and f(array_column) ~ 'response'; QUERY PLAN ------------------------------------------------------------------------------ Bitmap Heap Scan on testgin (cost=76.08..120.38 rows=1 width=85) Recheck Cond: (f(array_column) ~ 'response'::text) Filter: ('response'::text % ANY (array_column)) -> Bitmap Index Scan on testgin_f_idx (cost=0.00..76.08 rows=11 width=0) Index Cond: (f(array_column) ~ 'response'::text) (5 rows)

+3

3manuek 20 sept '16 at 1:06

source share

klin · Accepted Answer · 2016-09-19T22:22:14+0000

Why it does not work

The index type (i.e. the operator class) gin_trgm_ops based on the % operator, which works with two text arguments:

 CREATE OPERATOR trgm.%( PROCEDURE = trgm.similarity_op, LEFTARG = text, RIGHTARG = text, COMMUTATOR = %, RESTRICT = contsel, JOIN = contjoinsel);

You cannot use gin_trgm_ops for arrays. The index defined for an array column will never work with any(array[...]) , because the individual elements of the arrays are not indexed. To index the array, you need another type of index, namely the index of the gin array.

Fortunately, the gin_trgm_ops index was so thoughtful that it works with like and ilike , which can be used as an alternative solution (an example is described below).

Test table

It has two columns (id serial primary key, names text[]) and contains 100,000 Latin sentences, divided into array elements.

 select count(*), sum(cardinality(names))::int words from test; count | words --------+--------- 100000 | 1799389 select * from test limit 1; id | names ----+--------------------------------------------------------------------------------------------------------------- 1 | {fugiat,odio,aut,quis,dolorem,exercitationem,fugiat,voluptates,facere,error,debitis,ut,nam,et,voluptatem,eum}

A search for the word praesent gives 7051 lines in 2400 ms:

 explain analyse select count(*) from test where 'praesent' % any(names); QUERY PLAN --------------------------------------------------------------------------------------------------------------- Aggregate (cost=5479.49..5479.50 rows=1 width=0) (actual time=2400.866..2400.866 rows=1 loops=1) -> Seq Scan on test (cost=0.00..5477.00 rows=996 width=0) (actual time=1.464..2400.271 rows=7051 loops=1) Filter: ('praesent'::text % ANY (names)) Rows Removed by Filter: 92949 Planning time: 1.038 ms Execution time: 2400.916 ms

Materialized view

One solution is to normalize the model associated with creating a new table with one name in one row. Such a restructuring can be difficult to implement, and sometimes impossible due to existing queries, views, functions, or other dependencies. A similar effect can be achieved without changing the structure of the table using a materialized view.

 create materialized view test_names as select id, name, name_id from test cross join unnest(names) with ordinality u(name, name_id) with data;

With ordinality not required, but it can be useful when aggregating names in the same order as in the main table. The test_names gives the same results as the main table, at the same time.

After creating the runtime, the index decreases several times:

 create index on test_names using gin (name gin_trgm_ops); explain analyse select count(distinct id) from test_names where 'praesent' % name QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=4888.89..4888.90 rows=1 width=4) (actual time=56.045..56.045 rows=1 loops=1) -> Bitmap Heap Scan on test_names (cost=141.95..4884.39 rows=1799 width=4) (actual time=10.513..54.987 rows=7230 loops=1) Recheck Cond: ('praesent'::text % name) Rows Removed by Index Recheck: 7219 Heap Blocks: exact=8122 -> Bitmap Index Scan on test_names_name_idx (cost=0.00..141.50 rows=1799 width=0) (actual time=9.512..9.512 rows=14449 loops=1) Index Cond: ('praesent'::text % name) Planning time: 2.990 ms Execution time: 56.521 ms

The solution has several disadvantages. Because the view is materialized, the data is stored twice in the database. Remember to update the view after changes to the main table. And queries can be more complex due to the need to join the view to the main table.

Using `ilike`

We can use ilike on arrays represented as text. We need an immutable function to create the index in the array as a whole:

 create function text(text[]) returns text language sql immutable as $$ select $1::text $$ create index on test using gin (text(names) gin_trgm_ops);

and use the function in the queries:

 explain analyse select count(*) from test where text(names) ilike '%praesent%' QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=117.06..117.07 rows=1 width=0) (actual time=60.585..60.585 rows=1 loops=1) -> Bitmap Heap Scan on test (cost=76.08..117.03 rows=10 width=0) (actual time=2.560..60.161 rows=7051 loops=1) Recheck Cond: (text(names) ~~* '%praesent%'::text) Heap Blocks: exact=2899 -> Bitmap Index Scan on test_text_idx (cost=0.00..76.08 rows=10 width=0) (actual time=2.160..2.160 rows=7051 loops=1) Index Cond: (text(names) ~~* '%praesent%'::text) Planning time: 3.301 ms Execution time: 60.876 ms

60 vs 2400 ms, a pretty nice result without the need to create additional relationships.

This solution seems simpler and requires less work, provided that ilike , which is a less accurate tool than the trgm % operator, is sufficient.

Why should we use ilike instead of % for whole arrays as text? The similarity largely depends on the length of the texts. It is very difficult to choose a suitable limit for searching a word in long texts of various lengths. For example. with limit = 0.3 we have the results:

 with data(txt) as ( values ('praesentium,distinctio,modi,nulla,commodi,tempore'), ('praesentium,distinctio,modi,nulla,commodi'), ('praesentium,distinctio,modi,nulla'), ('praesentium,distinctio,modi'), ('praesentium,distinctio'), ('praesentium') ) select length(txt), similarity('praesent', txt), 'praesent' % txt "matched?" from data; length | similarity | matched? --------+------------+---------- 49 | 0.166667 | f <--! 41 | 0.2 | f <--! 33 | 0.228571 | f <--! 27 | 0.275862 | f <--! 22 | 0.333333 | t 11 | 0.615385 | t (6 rows)

How to index a column of a string array for pg_trgm `` term '% ANY (array_column) `query?

Why it does not work

Test table

Materialized view

Using ilike

More articles:

Using `ilike`