Why is Solr so much faster than Postgres?

Question

Why is Solr so much faster than Postgres?

I recently switched from Postgres to Solr and saw ~ 50 times in our queries. The queries that we run include several ranges, and our data includes vehicle listings. For example: "Find all vehicles with mileage <50,000, $ 5000 <price <$ 10,000, make = Mazda ..."

I created indexes in all the relevant columns in Postgres, so this should be a fairly fair comparison. Looking at the query plan in Postgres, although it still just used one index and then crawled (I guess because it could not use all the different indexes).

As I understand it, Postgres and Solr use vaguely similar data structures (B-trees), and both of them cache data in memory. Therefore, I wonder where such a big difference in performance comes from.

What differences in architecture will explain this?

+59

performance postgresql rdbms lucene solr

cberner Apr 7 '12 at 8:40

source share

5 answers

You didn’t talk very much about what you did to set up your PostgreSQL instance or your queries. It's unusual to see a 50 percent speed in a PostgreSQL query by setting up and / or re-querying your query in a format that is better optimized.

Just this week, there was a report at work that someone wrote using Java and several queries in a way that, based on how far he earned in four hours, should have taken about a month. (He needed to hit five different tables, each containing hundreds of millions of rows.) I rewrote it using several CTEs and a window function so that it works for less than ten minutes and generates the desired results directly from the query. It is a 4400x speed up.

Perhaps the best answer to your question has nothing to do with the technical details of how you can search in each product, but more so with ease of use in your specific use case. Obviously, you were able to find a quick way to search using Solr with fewer problems than PostgreSQL, and this may not go beyond something more.

I include a brief example of how text queries for several criteria can be executed in PostgreSQL, and how a few small tweaks can greatly affect performance. To make it quick and easy, I just run War and Peace in text form into a test database, with each “document” being a single text string. Similar methods can be used for arbitrary fields using hstore or JSON columns if the data needs to be poorly defined. Where there are separate columns with their own indexes, the benefits of using indexes are usually much greater.

 -- Create the table. -- In reality, I would probably make tsv NOT NULL, -- but I'm keeping the example simple... CREATE TABLE war_and_peace ( lineno serial PRIMARY KEY, linetext text NOT NULL, tsv tsvector ); -- Load from downloaded data into database. COPY war_and_peace (linetext) FROM '/home/kgrittn/Downloads/war-and-peace.txt'; -- "Digest" data to lexemes. UPDATE war_and_peace SET tsv = to_tsvector('english', linetext); -- Index the lexemes using GiST. -- To use GIN just replace "gist" below with "gin". CREATE INDEX war_and_peace_tsv ON war_and_peace USING gist (tsv); -- Make sure the database has statistics. VACUUM ANALYZE war_and_peace;

After setting up indexing, I show several queries with row counts and timings with both types of indexes:

 -- Find lines with "gentlemen". EXPLAIN ANALYZE SELECT * FROM war_and_peace WHERE tsv @@ to_tsquery('english', 'gentlemen');

84 lines, gist: 2.006 ms, gin: 0.194 ms

 -- Find lines with "ladies". EXPLAIN ANALYZE SELECT * FROM war_and_peace WHERE tsv @@ to_tsquery('english', 'ladies');

184 lines, gist: 3,549 ms, din: 0,328 ms

 -- Find lines with "ladies" and "gentlemen". EXPLAIN ANALYZE SELECT * FROM war_and_peace WHERE tsv @@ to_tsquery('english', 'ladies & gentlemen');

1 line, gist: 0.971 ms, din: 0.104 ms

Now, since the GIN index was about 10 times faster than the GiST index, you might wonder why someone would use GiST to index text data. The answer is that GiST is generally faster to maintain. Therefore, if your text data is highly volatile, the GiST index can win with the total load, while the GIN index will win if you are only interested in search time or a workload with a higher load.

Without an index, the above queries are taken from 17.943 ms to 23.397 ms, since they must scan the entire table and check for each row.

Finding GIN indexes for rows with "ladies" and "gentlemen" is more than 172 times faster than scanning a table in the exact same database. Obviously, the benefits of indexing would be more dramatic with large documents than for this test.

The setting, of course, is one-time. Using a trigger to support the tsv column tsv any changes made can be instantly searched without re-configuring any of the settings.

With a slow PostgreSQL query, if you show the table structure (including indexes), the problem query and the EXPLAIN ANALYZE result of your query, someone can almost always identify the problem and suggest how to get it run faster.

UPDATE (December 9 '16)

I did not mention what I used to get the previous timings, but based on the date, this would probably be the main release 9.2. I just went through this old thread and tried it again on the same hardware using version 9.6.1 to find out if this intermediate performance tuning in this example helps. Requests for only one argument only increased in productivity by about 2%, but the search for lines with "ladies" and "gentlemen" approximately doubled in speed to 0.053 ms (i.e. 53 microseconds) when using the GIN index (inverted).

+35

kgrittn Apr 7 2018-12-14T00:

source share

The biggest difference is that the Lucene / Solr index looks like a single table database without support for relational queries (JOINs). Remember that an index usually exists only to support search and is not the main data source. Thus, your database may be in the "third normal form", but the index will be completely de-normalized and contain basically only the data needed for the search.

Another possible reason is that in most cases, databases suffer from internal fragmentation, they have to perform too many semi-tasking I / O tasks with huge queries.

What this means, for example, given the architecture of the database index, a query results in indexes, which in turn lead to data. If recovery data is widespread, the result will take a long time, and it seems to be happening in databases.

+6

Yavar Apr 7 2018-12-12T00:

source share

Solr is primarily intended for data retrieval, not storage. This allows him to abandon most of the functionality required by RDMS. Therefore, he (or rather lucene ) concentrates on purely indexing data.

As you have no doubt discovered, Solr allows you to simultaneously search and retrieve data from it. This is the last (optional) feature that leads to a natural question ... "Can I use Solr as a database?"

The answer is qualified yes, and I refer to the following:

https://stackoverflow.com/questions/5814050/solr-or-database
Using Solr's search index as a database is "wrong" <? >
For keeper solr - new database

My personal opinion is that Solr is best viewed as a cache for searching between my application and the data processed in my database. Thus, I get the best of both worlds.

+5

Mark O'Connor Apr 07 '12 at 9:30

source share

Please read this and.

Solr (Lucene) creates an inverted index in which data is retrieved fairly quickly. I read that PostgreSQL also has similar capabilities, but not sure if you used this.

The performance differences that you observed can also be attributed to "what were you looking for?", "What are the user's queries?"

+1

Tejas Patil Apr 07 2018-12-12T00:

source share

jpountz · Accepted Answer · 2012-04-07 11:31

First, Solr does not use B-trees. The Lucene index (the base library used by Solr) consists of read-only segments. For each segment, Lucene maintains a glossary of terms, which consists of a list of terms that appear in a segment that is lexicographically sorted. Search for a term in this glossary of terms is performed using binary search, so the cost of a simultaneous search is O(log(t)) , where t is the number of terms. In contrast, using the standard RDBMS index costs O(log(d)) , where d is the number of documents. When many documents have the same value for a certain field, this can be a big win.

Moreover, Lucene committer Uwe Schindler added support for very efficient numerical range queries a few years ago. For each value of a numeric field , Lucene stores several values with different values. This allows Lucene to work with range queries very efficiently. Since your use case seems to use a lot of numeric range queries a lot, this might explain why Solr is much faster. (For more information, read javadocs, which are very interesting and provide links to related research papers.)

But Solr can do this only because it does not have all the restrictions that an RDBMS has. For example, Solr does very poorly update one document at a time (he prefers batch updates).

Why is Solr so much faster than Postgres?

More articles: