Indexing and alternatives for low selectivity columns

What is the range of tactics for selecting records on columns with a low degree of selectivity?

An example is the table of orders, in which over the years you create a large number of completed orders, but often have to choose active orders. An order can go through a life cycle, such as a placed one, placed in a warehouse, selected from a warehouse, sent to a customer, billed and paid. The order can be optionally canceled, saved, etc. Most of the records will end up in the final state (for example, paid), but you often need to select, say, highlighted orders. In this case, sequential reading will be slow.

Similar MySQL indexing questions : low power / selectivity columns = how to index?
Indexes suck in SQL?
What are indexes and how to use them to optimize queries in my database? Index Definition: Which Columns and Performance Impact? and many others related to reduction.

The approaches I've read about (in stackoverflow and elsewhere) include

  • Use raster image index
  • Use partial index ( create index x on t(c2) where c1='a' )
  • Use a clustered index?
  • Do not index low selectivity columns, use sequential reads
  • Divide the data (for example, into several tables with the same layout)
  • Use an extra table (e.g. active_customers(customer_id)

My current DBMS does not support the first three parameters listed above, and the rest seems problematic - are there any other widely used approaches?

Update: I have seen - index your column with low selectivity, but just always choose high selectivity values.

+8
sql indexing relational-database database-design
source share
4 answers

I agree with the Unreason branch , however . But there is something you need to know about this case.

This is called skew and skew. This is an ideal use for a partial index, in which you would exclude 95% of paid bills and indicate only more interesting and selective statistics. But you don’t have it. You can horizontally split all rows into separate tables / partitions, but then you need to consider row migration (moving from one status to another) and it’s expensive. The DBMS must perform an update, delete, and insert to change the status. If you are a system with a large volume that will hurt.

Forget what you said about whether to index based on selectivity, because adding an index to a fast-changing column is also a bad idea. Your index will have hot blocks, where all steps 1 are deleted, and the other, where the whole step 2 is inserted, and oh btw, some steps 2 are simultaneously deleted to step 3. This will not scale well.

I would recommend vertically dividing your status into separate tables.

The invoice table will show PK and all columns except status.

You can process your status in two ways. This table will have a PK value of FK for the table of accounts, status and time stamp when you entered this status. Best of all is a horizontally split status table. You will have a section for each state. Thus, upon detection of all or one “Hosted” status, cropping will be divided and only the section that he needs to read will be read - this is a very small number of blocks. Since the line is so narrow, you can get 400 account statuses on one block. Finding the status of any invoice is easy as there is a global index on the PC.

If your RDBMS does not support split with row migration, you need to manage these sections in the form of tables and delete them from one and paste them into another. You will encapsulate these movements in a transaction in a procedure to keep the data clean. Each invoice is in one and only one status table. The more difficult part is the request by invoice ID, you will need to check each table to see where it is.

You have another choice. You can either write paid statuses or not. If it is a partitioned table, you can simply delete the invoice from the account status table when it moves to the paid one. (Of course, you will write a paid report in the history table mentioned in the bonus material). Then you will make an external join in the status table, and null - the average payment. If you almost never ask for paid status, there really is no reason for a quick request.

Bonus Material

In any case, you want to track these movements in the report table. Each time you update a status, you want to write it to the history table. In the end, you will want to analyze what I call transit time. What is the average time from filling to paid, by months? Does this increase as a result of a poor economy? that the transit time from place to fill is monthly. Does the summer months take longer due to lack of body on vacation? you understand. By updating this column, you are losing these answers, so you need to incorporate this history log into your procedures.

+3
source share

Of all the approaches that you have indicated only one (use sequential reading), there is an approach that has something to do with low selectivity (well, a clustered one can also qualify).

If you have low selectivity in the column, this means that scanning will work better than searching.

Index can be used to

  • index queries - check pointer pointer, retrieve record, retry
  • index scan - scan the index and get values ​​directly from the index

otherwise it is not very useful.

If the selectivity is low, this means that most of the index will be read and, if a search will be used, most of the data will then be read in some random order. This is inefficient if you cover a significant percentage of the base table, so doing a sequential read is the best way (which is also slow).

So, if selectivity is low, you can't do anything (clustering can help).

However , I'm not sure that you understand that in your example you do not have low selectivity. As you say, most entries will be paid and there will be very few entries. These (highlighted) entries will have high selectivity. Especially if there are additional conditions , and if there is a composite index containing these additional conditions.

So you can hit your head about no problem.

Now, however, you can improve performance by splitting data or using an extra table (if you need to).

+3
source share

Partitioning is an approach that stores the same table in separate areas based on data. SQL developers should not have access to separate tables.

I think it is ideal for the problem described - you can find more about it on Informix here: http://www.dbmag.intelligententerprise.com/blog/main/archives/2008/09/data_partitioni.html

+1
source share

If you can weaken the normalization of the database, and the number of possible states is small (for example, <5), you can add one column, which allows a value of NULL, to each state and place indexes in these columns. Many mechanisms (e.g. MongoDB) skip rows with null values ​​and only index rows with actual data (sparse indexes). For example:

 Invoice# Date State IsPlaced IsPaid IsFulfilled 1 Apr-20 Fulfilled (null) (null) yes 2 Apr-20 Fulfilled (null) (null) yes 3 Apr-20 Fulfilled (null) (null) yes 4 Apr-21 Fulfilled (null) (null) yes 5 Apr-21 Fulfilled (null) (null) yes 6 Apr-21 Paid (null) yes (null) 7 Apr-21 Placed yes (null) (null) 8 Apr-22 Placed yes (null) (null) 9 Apr-22 Paid (null) yes (null) 10 Apr-22 Placed yes (null) (null) 

You can store this information in a separate table and, possibly, be guided by triggers, or at least check it with restrictions.

This is not a universal solution, and in fact it has poor scalability, but allows you to use column breaks that are more understandable, for example, the billing date.

This type of trick is often used in data warehouse projects, where the efficiency of processing large data sets is more important than data normalization.

0
source share

All Articles