How do you create a multi-column search database?

I am creating a property search from RETS data using MySQL, but this is a general question. When you have many columns that you would like the user to be able to filter their search result, how do you optimize it?

For example, http://www.charlestonrealestateguide.com/listings.php has 16 or so additional filters. Of course, it has only up to 11,000 records (I have the same data), but I do not think that the search is performed only with the giant WHERE AND AND AND ... . Or is it usually done with a single gigantic multi-column index?

Newegg, Amazon and many others also have cool and fast filtering systems for large amounts of data. How do they do it? And is there a reason to optimize the database for the tendency to provide ranges instead of empty inputs or just for the convenience of the user?

+4
source share
5 answers

MySQL Editing

Some RDBMSs seem to have some ability in this regard.

In Mysql there is some kind of index "joins" according to the documentation .

[Before MySQL5] MySQL was able to use no more than one index for each reference table

But at 5, it supports limited index merging.

You really need to understand how indexes work and when they are useful. What percentage of rows does a full table scan make more sense than an index? Do you think that in some scenarios FTS is cheaper than scanning an index that returns 2% of rows? If the histogram of your bedroom looks like this: 1 = 25%, 2 = 50%, 3 = 20%,> 3 = 5% ... The only time the index in this column is useful is to find more than 3 bedrooms, t use it then due to binding variables and clustering factors.

think of it like that. Suppose my bedroom percentage is correct. Let's say you have 8k pages (dunno what Mysql uses) and each row is 80 bytes long. Ignoring the overhead, you have 100 lines (lists) per disk page. Since the houses are added in random order (random as the bedrooms go) on each page, you will have 50 2-bedroom houses, 25 1-bedroom houses, 20 3-bedroom houses and maybe 4 or 5 or so on this page Each page will have at least one 1-bedroom house, so you will read EVERY page for BEDROOMS = 1, the same for 2, the same for 3. This can help for 5-bedroom houses ... but if MySQL will bind the variable operation as Oracle, then it will not change plans for the given value of the Bedroom.

As you can see, there is much to understand ... More than John Skeet pointed out.

Original post

Most DBMSs cannot combine indexes in one table. If you have a table with columns A, B and C with indexes with one column on A, B and C. and you search, where A = a and B = b and C = c. He will choose the most selective index and will use only one.

If you create one, multi-column index in A, B, C, then this index will not work unless you include A = a in WHERE. If your B = b and C = c, then this index is ignored - in most DBMSs.

This is why Oracle invented the Bitmap index. The Bitmap index on A, B, and C can be combined with bitwise AND bitwise OR operations. Until the final set of Rowids and the selected selected columns are determined.

The bitmap index in the REGION column is displayed in the last four columns.

  Row Region North East West South 1 North 1 0 0 0 2 East 0 1 0 0 3 West 0 0 1 0 4 West 0 0 1 0 5 South 0 0 0 1 6 North 1 0 0 0 

So, if you say that you want a house where the region is located (north, east). You are beaten by the index of the North and East indices and end in rows 1, 2, 6

If you have another bedroom column, for example

  Row Bedrooms 1BR 2BR 1 1 1 0 2 2 0 1 3 1 1 0 4 1 1 0 5 2 0 1 6 2 0 1 

if you And Bedrooms = 2, this index will return 2, 5, 6 and when Bitwise AND'ed in the Region column will result in rows 2 and 6.

But since you could not mention the DBMS, I may have completely lost my time. Oh, OK.

+2
source

I believe this post. Explain advanced solves your question. This is a long and detailed example of many examples. I cut / paste his resume to wet your appetite:

In some cases, a range predicate (for example, “less”, “greater”, or “between”) can be rewritten as an IN predicate against a list of values ​​that can satisfy the range state.

Depending on the type of column, check the restrictions and statistics, the list may consist of all possible values ​​defined by the column domains; all possible values ​​determined by the minimum columns and the maximum value or all actual values ​​contained in the table. c In the latter case, an attenuated index scan could have retrieved a list of such values.

Since the condition of equality is applied to each value in the list, more access and union methods can be used to build a simple query, including a range of conditions on the secondary columns of the index, hash search, etc.

Whenever the optimizer builds a plan for a query containing a predicate range, it should consider rewriting the range condition as an IN predicate and use the latter method if it is more efficient.

+3
source

Would it be a query WHERE x='y' AND a='b' etc?

I would think that it should have several separate indexes - nothing special is needed.

+1
source

I assume that your search criteria are discrete rather than free forms, that is, you filter what you can quantify as the number of bedrooms, plot size, etc., regardless of whether it is in a “sunny” place "In this case, I would suggest that you want to build the query dynamically so that the query only considers the columns of interest in the database. Probably the single-column indexes are adequate, especially considering that you don't have a lot of data. If you find that l Because people always specify a pair of columns — for example, the number of bedrooms and the number of bathrooms — adding a composite index for this combination of columns can be useful, however, I would allow statistics and performance to manage these decisions.

If you request only one table, it will select the best index to use, if applicable. From this point of view, you want to select columns that are good discriminators and are likely to be used in the filter. Limiting the number of indexes can be good if you know that some columns either quickly limit the number of returned results, or, conversely, that a particular column is not a good discriminator. If, for example, 90% of the houses you listed have a plot size smaller than an acre, and most people look for plots smaller than acres (or don't care), then an index scan based on this index usually does not have a better scan than a table scan, and no index needed. Indexes are worth what you need to calculate, although for a small database like yours, with infrequent inserts, which are probably not a problem.

@Jon is right, I think you probably want to combine the filter properties with AND rather than with OR. That is, people usually look for a house with 3 bedrooms and 2 bathrooms, rather than 3 bedrooms or 2 bathrooms. If you have a filter that allows several options, then you can use IN-say PropertyType IN ('Ranch','SplitLevel',...) instead of explaining OR (it works the same, but is more readable). Note that most likely you are using the foreign key in the PropertyTypes table, not in the text here, but I used the values ​​for illustration only.

+1
source

You need a full-text search engine. Amazon and others use the same thing. Take a look at http://lucene.apache.org/ , and if your platform is based on Java, then a much higher level of abstractions can be www.elasticsearch.com and Search in sleep mode.

0
source

Source: https://habr.com/ru/post/1310982/


All Articles