Exclude items selectively from the Sitecore Lucene search index - Works when rebuilding with IndexViewer, but not when using Sitecore built-in tools

On a site running Sitecore 6.2, I need to give the user the ability to selectively exclude items from the search results.

To do this, I added the "Include in search results" check box, and I created a custom database crawler to check this field value:

~ \ App_Config \ Include \ Search Indexes \ Website.config:

<search> <configuration type="Sitecore.Search.SearchConfiguration, Sitecore.Kernel" singleInstance="true"> <indexes hint="list:AddIndex"> <index id="website" singleInstance="true" type="Sitecore.Search.Index, Sitecore.Kernel"> ... <locations hint="list:AddCrawler"> <master type="MyProject.Lib.Search.Indexing.CustomCrawler, MyProject"> ... </master> <!-- Similar entry for web database. --> </locations> </index> </indexes> </configuration> </search> 

~ \ Lib \ Search \ Indexing \ CustomCrawler.cs:

 using Lucene.Net.Documents; using Sitecore.Search.Crawlers; using Sitecore.Data.Items; namespace MyProject.Lib.Search.Indexing { public class CustomCrawler : DatabaseCrawler { /// <summary> /// Determines if the item should be included in the index. /// </summary> /// <param name="item"></param> /// <returns></returns> protected override bool IsMatch(Item item) { if (item["include in search results"] != "1") { return false; } return base.IsMatch(item); } } } 

Interestingly, if I rebuild the index using the Index Viewer application, everything behaves as usual. Items whose "Include in search results" box is not checked will not be included in the search index.

However, when I use the search index rebuilder in the Sitecore dashboard application or when the IndexingManager automatically updates the search index, all items are included regardless of whether they are β€œInclude in search results”.

I also set many breakpoints in my custom crawler class, and the application never hits any of them when I rebuild the search index using the built-in indexer. When I use the Index Viewer, it hits all the breakpoints that I set.

How do I make Sitecore's built-in indexing processes respect my "Include in search results" checkbox?

+7
source share
3 answers

Yesterday I talked with Alex Sheyba, and we managed to find out what was happening. There were a couple of problems in my configuration that prevented everything from working:

  • As Seth pointed out, Sitecore has two different search APIs. My configuration file used both of them. To use the new API, you only need to configure the sitecore/search/configuration section (In addition to what I put in my OP, I also added indexes to sitecore/indexes and sitecore/databases/database/indexes , which is wrong).

  • Instead of overriding IsMatch() I had to override AddItem() . Due to how Lucene works, you cannot update the document in-place; instead, you must first remove it and then add the updated version.

    When Sitecore.Search.Crawlers.DatabaseCrawler.UpdateItem() is executed, it checks IsMatch() to see if the item should be deleted and re-added. If IsMatch() returns false, the item will not be removed from the index, even if it should not be there in the first place.

    AddItem() , I was able to instruct the crawler whether the item should be added to the index after its existing documents have already been deleted. Here's what the updated class looks like:

    ~ \ Lib \ Search \ Indexing \ CustomCrawler.cs:

     using Sitecore.Data.Items; using Sitecore.Search; using Sitecore.Search.Crawlers; namespace MyProject.Lib.Search.Indexing { public class CustomCrawler : DatabaseCrawler { protected override void AddItem(Item item, IndexUpdateContext context) { if (item["include in search results"] == "1") { base.AddItem(item, context); } } } } 

Alex also noted that some of my scalability settings were incorrect. In particular:

  • The InstanceName parameter was empty, which could cause problems with ephemeral (cloud) instances, where the machine name may change between executions. We changed this setting for each instance to have a constant and different value (for example, CMS and CD ).

  • The Indexing.ServerSpecificProperties parameter must be true so that each instance maintains its own record about the last time it updated its search index.

  • The EnableEventQueues parameter must be true to prevent race conditions between search indexing and cache EnableEventQueues processes.

  • When developing Indexing.UpdateInterval should set a relatively small value (for example, 00:00:15 ). This is not very convenient for production environments, but it reduces the amount of expectation that you have to fulfill when fixing problems with search indexing.

  • Ensure that the story engine is enabled for each web database, including remote publishing targets:

     <database id="production"> <Engines.HistoryEngine.Storage> <obj type="Sitecore.Data.$(database).$(database)HistoryStorage, Sitecore.Kernel"> <param connectionStringName="$(id)" /> <EntryLifeTime>30.00:00:00</EntryLifeTime> </obj> </Engines.HistoryEngine.Storage> <Engines.HistoryEngine.SaveDotNetCallStack>false</Engines.HistoryEngine.SaveDotNetCallStack> </database> 

To manually rebuild the search indexes on CD instances, since there is no access to the Sitecore backend, I also set up RebuildDatabaseCrawlers.aspx (from this article ).

+4
source

I think I understood the decision halfway.

Here's an interesting snippet from Sitecore.Shell.Applications.Search.RebuildSearchIndex.RebuildSearchIndexForm.Builder.Build() , which is called by reengineering the search index in the control panel application:

 for (int i = 0; i < database.Indexes.Count; i++) { database.Indexes[i].Rebuild(database); ... } 

database.Indexes contains the Sitecore.Data.Indexing.Index suite, which the database crawler does not use to restore the index!

In other words, the built-in search indexer uses a completely different class when restoring a search index, which completely ignores the search configuration parameters in web.config .

To get around this, I changed the following files: ~ \ App_Config \ Include \ Search Indexes \ Website.config:

 <indexes> <index id="website" ... type="MyProject.Lib.Search.Indexing.CustomIndex, MyProject"> ... </index> ... </indexes> 

~ \ Lib \ Search \ Indexing \ CustomIndex.cs:

 using Sitecore.Data; using Sitecore.Data.Indexing; using Sitecore.Diagnostics; namespace MyProject.Lib.Search.Indexing { public class CustomIndex : Index { public CustomIndex(string name) : base(name) { } public override void Rebuild(Database database) { Sitecore.Search.Index index = Sitecore.Search.SearchManager.GetIndex(Name); if (index != null) { index.Rebuild(); } } } } 

The only caveat to this method is that it will rebuild the index for each database, not just the selected one (which, I assume, why Sitecore has two completely different methods for restoring indexes).

+1
source

Sitecore 6.2 uses both the old and newer api searches, hence the differences in how the index is created, I believe. CMS 6.5 (coming soon) just uses the new api - e.g. Sitecore.Search

+1
source

All Articles