Optimum performance for joining a range of values

I have a really big table that contains integer representations of IP addresses and a second table with start and end ranges of integers of IP addresses. The second table is used to return the country, in accordance with https://stackoverflow.com/a/412929/ Although this returns the required results, the performance is pretty low. Is there a better alternative to joining a range? The following is sample code that shows how the connection works:

CREATE TABLE #BaseTable ( SomeIntegerValue INT PRIMARY KEY); INSERT INTO #BaseTable (SomeIntegerValue) SELECT SomeIntegerValue FROM (VALUES (123), (456), (789)) Data (SomeIntegerValue); CREATE TABLE #RangeLookupTable ( RangeStartValue INT PRIMARY KEY , RangeEndValue INT NOT NULL); INSERT INTO #RangeLookupTable (RangeStartValue, RangeEndValue) SELECT RangeStartValue, RangeEndValue FROM (VALUES (0, 100), (101, 200), (201, 300) , (301, 400), (401, 500), (501, 600) , (701, 800), (901, 1000)) Data (RangeStartValue, RangeEndValue); SELECT * FROM #BaseTable bt JOIN #RangeLookupTable rlt ON bt.SomeIntegerValue BETWEEN rlt.RangeStartValue AND rlt.RangeEndValue 
+7
source share
4 answers

If a specific situation allows storing unnormalized data in a table and then querying from that table instead of the normalized base table, a very fast search time can be achieved. The query execution plan shows 2x gain in SELECT, even if the data in this example consists of 3 rows.

Such an approach would be possible in a scenario with a relatively smaller number of write operations and more read operations. JOIN will need to be performed only when updating data; testing with actual data will show how much (or even at all!) the improvement is actually achieved in the overall (UPDATE + SELECT) system image.

Sample code as well as screenshots of the execution plan for SELECT statements are given below.

 CREATE TABLE #BaseTable ( SomeIntegerValue INT PRIMARY KEY); INSERT INTO #BaseTable (SomeIntegerValue) SELECT SomeIntegerValue FROM (VALUES (123), (456), (789)) Data (SomeIntegerValue); CREATE TABLE #RangeLookupTable ( RangeStartValue INT PRIMARY KEY , RangeEndValue INT NOT NULL); INSERT INTO #RangeLookupTable (RangeStartValue, RangeEndValue) SELECT RangeStartValue, RangeEndValue FROM (VALUES (0, 100), (101, 200), (201, 300) , (301, 400), (401, 500), (501, 600) , (701, 800), (901, 1000)) Data (RangeStartValue, RangeEndValue); -- Alternative approach: Denormalized base table CREATE TABLE #BaseTable2 ( SomeIntegerValue INT PRIMARY KEY, RangeStartValue INT null, RangeEndValue INT NULL); INSERT INTO #BaseTable2 (SomeIntegerValue) SELECT SomeIntegerValue FROM (VALUES (123), (456), (789)) Data (SomeIntegerValue); UPDATE #BaseTable2 SET RangeStartValue = rlt.RangeStartValue, RangeEndValue = rlt.RangeEndValue FROM #BaseTable2 bt2 JOIN #RangeLookupTable rlt ON bt2.SomeIntegerValue BETWEEN rlt.RangeStartValue AND rlt.RangeEndValue -- The original: SELECT with a JOIN SELECT * FROM #BaseTable bt JOIN #RangeLookupTable rlt ON bt.SomeIntegerValue BETWEEN rlt.RangeStartValue AND rlt.RangeEndValue -- The alternative: SELECT from the denormalized base table SELECT * from #BaseTable2; GO 

Query execution plans for JOINed or denormalized SELECT:

Query Execution with a JOIN vs. a denormalized table

+1
source

This is almost certainly an indexing issue. You currently have an index on RangeStartValue (primary key), but not on RangeEndValue , so you may have to do a full scan of the second column after narrowing the first. Try indexing RangeEndValue and see how this affects it.

I am not very well versed in the performance qualities of the BETWEEN offer, but you can guarantee that this is not a problem by writing both sides of the comparison with inequality checks.


Also in this test script you select each row in the base table, which I suppose you don't do in your db production?

0
source

The problem is that your lookup table has non-overlapping address fragments (ranges). However, SQL Server probably does not recognize this. So, when you have ipaddress between A and B , it should scan the entire index, starting from the beginning to A.

I do not know if there is a way to explain what the table does in such a way that the optimizer goes to the first matching record in the index. Perhaps something like this will work:

 select bt.*, (select top 1 RangeEndValue from #RangeLookupTable rlt where rlt.RangeStartValue <= bt.SomeIntegerValue order by RangeStartValue desc) FROM #BaseTable bt 

This can trick the optimizer into looking for only one entry in the index. The data in your example is too small to determine if this will affect performance.

Another approach is to use equi-join to stop the search. In each table, add a portion of the TypeA address (first byte). This can either be redundant if the second field has a full address, or you can put the remaining three bytes in the next field. Any ip list that spans multiple TypeA addresses should be split into separate entries.

Make this field the first column in the index with the address (or the rest of the address) as the second part of the primary key. You can use constraints to create a primary key with multiple columns.

The request will look like this:

 SELECT * FROM #BaseTable bt join #RangeLookupTable rlt ON bt.typeA = rlt.typeA and bt.SomeIntegerValue BETWEEN rlt.RangeStartValue AND rlt.RangeEndValue 

Then the index scan will be limited only to values ​​with the same first byte. Of course, you can also extend this to TypeAB using the first two bytes.

0
source

I tested 15 separate approaches that I thought would work, and this solution was the best:

 SELECT bt.* , RangeStartValue = (SELECT TOP 1 RangeStartValue FROM #RangeLookupTable rlt WHERE bt.SomeIntegerValue >= rlt.RangeStartValue ORDER BY rlt.RangeStartValue) FROM #BaseTable bt; 

This creates a clustered index search in the lookup table and can take millions of records in seconds. Please note that I adapted this solution from the code on this blog .

0
source

All Articles