Hierarchical data structure design (nested sets)

I am working on the design of a hierarchical database structure that models a catalog containing products (this seems to be the case ). The database platform is SQL Server 2005, and the catalog is quite large (750,000 products, 8500 catalog sections at 4 levels), but relatively static (it reboots once a day), so we are only concerned about READ performance.

General directory hierarchy structure: -

  • Level 1 Section
    • Level 2 Section
      • Level 3 Section
        • Level 4 Section (Products Related)

We use the Nested Sets template to store hierarchy levels and store products that exist at that level in a separate related table. Thus, a simplified database structure will be

CREATE TABLE CatalogueSection ( SectionID INTEGER, ParentID INTEGER, LeftExtent INTEGER, RightExtent INTEGER ) CREATE TABLE CatalogueProduct ( ProductID INTEGER, SectionID INTEGER ) 

We have an additional complication in that we have about 1000 separate groups of customers who may or may not see all the products in the catalog. Because of this, we need to maintain a separate “copy” of the directory hierarchy for each group of customers so that when they look at the catalog they see only their products and also do not see any sections that are empty.

To facilitate this, we maintain a table of the quantity of products at each level of the hierarchy, "collapsed" from the section below. Thus, although products are directly related to the lowest hierarchy, they are counted down to the tree. Structure of this table

 CREATE TABLE CatalogueSectionCount ( SectionID INTEGER, CustomerGroupID INTEGER, SubSectionCount INTEGER, ProductCount INTEGER ) 

So, for the problem, Performance is very low at the top levels of the hierarchy. A general query showing the "top 10" products in a selected directory section (and all child sections) takes up to 1 minute to complete. At lower sections in the hierarchy, it is faster, but still not good enough.

I placed indexes (including coverage indexes, where applicable) on all key tables, ran them through a query analyzer, index tuning wizard, etc., but still cannot get it working fast enough.

I wonder if the design is fundamentally wrong or is it because we have such a large dataset? We have a reasonable development server (3.8 GHz Xeon, 4 GB of RAM), but it just does not work :)

Thanks for any help

James

+4
source share
3 answers

Use a closing table. If your base structure is a parent-child with field identifiers and ParentID, the structure for the closure table is an identifier and desensintant. In other words, a closure table is an ancestor-descendant table, where every possible ancestor is associated with all descendants. If you need, you can include the LevelsBetween field. Implementations of closure tables typically include self-regulatory entries, that is, identifier 1 is the ancestor of identifier 1 of the descendant with levels between zeros.

Example: Parent / Child
ParentID - ID
12
13
3 to 4
3 - 5
4 - 6

Ancestor / Descendant
ID - DescendantID - Levels Between

1 - 1 - 0
1 - 2 - 1
1 - 3 - 1
1 - 4 - 2
1 - 6 - 3
2 - 2 - 0
3 - 3 - 0
3 - 4 - 1
3 - 5 - 1
3 - 6 - 2
4 - 4 - 0
4 - 6 - 1
5 - 5 - 0

The table is intended to eliminate recursive joins. You load the recursive join load into the ETL loop that you do when you load data once a day. This shifts it from the request.

In addition, it allows hierarchy at the variable level. You are not stuck at 4.

Finally, it allows the use of slot products in non-leaf nodes. Many directories create “different” buckets at higher levels of the hierarchy to create a leaf node for connecting products. You do not need to do this, since the intermediate nodes are included in the closure.

As for indexing, I would make a clustered index in ID / DescendantID.

Now for your query performance. It takes some, but not all. You mentioned the "Top 10". This implies ranking on a number of facts that you have not mentioned. We need parts to help customize them. Moreover, it only receives leaf-level patches, not products. At the very least, you should have a pointer to your catalog, which is ordered under the ID / ProductID section. I would attach the section to the product as a combination of loops based on the power that you provided. The directory partition report will go to the closure table to get descendants (using a clustered index search). This descendant list will then be used to retrieve products from CatalogProduct using the index on the loop index. Then, with these products, you will receive the facts necessary for ranking.

+6
source

You could solve the problem with client groups with roles and treeId, but you have to provide us with a request.

0
source

Is it possible to calculate ProductCount and SubSectionCount after loading every day?
If the data changes only once a day, then, of course, it is worth calculating these numbers, even if some denormalization is required.

0
source

All Articles