I am working on the design of a hierarchical database structure that models a catalog containing products (this seems to be the case ). The database platform is SQL Server 2005, and the catalog is quite large (750,000 products, 8500 catalog sections at 4 levels), but relatively static (it reboots once a day), so we are only concerned about READ performance.
General directory hierarchy structure: -
- Level 1 Section
- Level 2 Section
- Level 3 Section
- Level 4 Section (Products Related)
We use the Nested Sets template to store hierarchy levels and store products that exist at that level in a separate related table. Thus, a simplified database structure will be
CREATE TABLE CatalogueSection ( SectionID INTEGER, ParentID INTEGER, LeftExtent INTEGER, RightExtent INTEGER ) CREATE TABLE CatalogueProduct ( ProductID INTEGER, SectionID INTEGER )
We have an additional complication in that we have about 1000 separate groups of customers who may or may not see all the products in the catalog. Because of this, we need to maintain a separate “copy” of the directory hierarchy for each group of customers so that when they look at the catalog they see only their products and also do not see any sections that are empty.
To facilitate this, we maintain a table of the quantity of products at each level of the hierarchy, "collapsed" from the section below. Thus, although products are directly related to the lowest hierarchy, they are counted down to the tree. Structure of this table
CREATE TABLE CatalogueSectionCount ( SectionID INTEGER, CustomerGroupID INTEGER, SubSectionCount INTEGER, ProductCount INTEGER )
So, for the problem, Performance is very low at the top levels of the hierarchy. A general query showing the "top 10" products in a selected directory section (and all child sections) takes up to 1 minute to complete. At lower sections in the hierarchy, it is faster, but still not good enough.
I placed indexes (including coverage indexes, where applicable) on all key tables, ran them through a query analyzer, index tuning wizard, etc., but still cannot get it working fast enough.
I wonder if the design is fundamentally wrong or is it because we have such a large dataset? We have a reasonable development server (3.8 GHz Xeon, 4 GB of RAM), but it just does not work :)
Thanks for any help
James