Indexing and querying high-dimensional data in postgreSQL

I want to index data in height dimensions (maybe 128 dimensional vectors of integers in the range [0.254]):

| id | vector | | 1 | { 1, 0, ..., 254} | | 2 | { 2, 128, ...,1} | | . | { 1, 0, ..., 252} | | n | { 1, 2, ..., 251} | 

I saw PostGIS implement R-Trees. Can I use these trees in PostGIS to index and query multidimensional vectors in Postgres?

I also saw that there is an index implementation for int arrays .

Now I have questions about how to complete the request.
Can I do a knn search and radius search in an integer array? Perhaps I should also define my own remote function. Is it possible? I want to use the Manhattan distance (block spacing) for my queries.

I can also represent my vector as a binary string with pattern v1;v2;...;vn . Does this help you search?

For example, if I had two lines:

 1;2;1;1 1;3;2;2 

The result / distance between these two lines should be 3.

+7
sql multidimensional-array indexing postgresql
source share
2 answers

Perhaps the best choice would be to expand the cube , since your area of ​​interest is not an individual integer, but a complete vector.

Cube supports GiST indexing, and Postgres 9.6 will also index KNN into cubes, supporting Euclidean, taxi (aka Manhattan) and Chebyshev distances .

It’s a little annoying that 9.6 is still under development, but there is no problem with a patch for backporting to expand the cube to 9.5, and I say this from experience.

We hope that 128 measurements will continue to be sufficient to produce meaningful results .

How to do it?

First enter an example table:

 create extension cube; create table vectors (id serial, vector cube); 

Fill the table with sample data:

 insert into vectors select id, cube(ARRAY[round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000)]) from generate_series(1, 2000000) id; 

Then try to choose:

 explain analyze SELECT * from vectors order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Limit (cost=123352.07..123352.09 rows=10 width=76) (actual time=1705.499..1705.501 rows=10 loops=1) -> Sort (cost=123352.07..129852.07 rows=2600000 width=76) (actual time=1705.496..1705.497 rows=10 loops=1) Sort Key: (('(966, 82, 765, 343, 600, 718, 338, 505)'::cube <#> vector)) Sort Method: top-N heapsort Memory: 26kB -> Seq Scan on vectors (cost=0.00..67167.00 rows=2600000 width=76) (actual time=0.038..998.864 rows=2600000 loops=1) Planning time: 0.172 ms Execution time: 1705.541 ms (7 rows) 

We must create an index:

 create index vectors_vector_idx on vectors (vector); 

Does it help:

 explain analyze SELECT * from vectors order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10; -------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.41..1.93 rows=10 width=76) (actual time=41.339..143.915 rows=10 loops=1) -> Index Scan using vectors_vector_idx on vectors (cost=0.41..393704.41 rows=2600000 width=76) (actual time=41.336..143.902 rows=10 loops=1) Order By: (vector <#> '(966, 82, 765, 343, 600, 718, 338, 505)'::cube) Planning time: 0.146 ms Execution time: 145.474 ms (5 rows) 

In 8 dimensions, this helps.

+11
source share

(Adding to the selected answer)

For people who want more than 100 dimensions, be careful: the limit is 100 dimensions per cube extension .

The tricky part is that postgres allows you to create cubes with over 100 dimensions. This is when you try to restore a failed backup (this is realized at worst).

As recommended in the documentation, I fixed the cube extension to support larger sizes. I made a docker image for it, and you can look at the Docker file to see how to do it yourself from github repos .

+4
source share

All Articles