Web Track Engine Used by Kentico 10

Is there more information about the web caterpillar technology / engine used by Kentico 10 as described in Setting Up Page Scanner Indexes ?

The reason I'm asking for is because I would like to consider it for use in a project with a custom crawler that can sit outside Kentico, and still allow it to have inherent compatibility with the Kentico platform.

+7
web-crawler kentico
source share
3 answers

As far as I can tell from the Kentico 10 source code, the crawler used by Kentico SmartSearch is fully patented. It does not use a third-party library.

Loads the contents of a page using System.Web.HttpWebRequest . The full content is returned back to the SmartSearch index as a string. After that, it goes through text extraction and feeds it to Lucene for indexing.

It will not be easy to use Kentico SmartSearch for an external crawler. We usually stay away from the crawler because it is quite expensive to execute compared to the standard index, which retrieves data directly from the database.

Kentico supports some scheduled tasks in the Windows service , but not search tasks.

Please note that Kentico SmartSearch does not actually crawl the site by opening links. It uses the content tree to find out what content it should index. If you want to index other content, for example, from the system with which you are integrating, you need to implement a custom search service, as described here .

One thing that will work is for the external process to crawl any content you want to index and put the original HTML content in the repository. Then write a custom SmartSearch index that retrieves data from the Kentico indexing store. If you index Kentico-driven content, you can take it to the next level by connecting to document events. This should allow you to crawl pages only when they refresh.

+1
source share

Kentico uses Lucene.NET . This is a great solution for stand-alone projects. I used it to manage a custom web API hosted in Azure.

Mike

-one
source share

Lucene uses Nutch http://nutch.apache.org/ , which is an open source search engine for indexing web content. This is part of the whole structure that lucene offers.

-one
source share

All Articles