Web Track Engine Used by Kentico 10

Question

Web Track Engine Used by Kentico 10

Is there more information about the web caterpillar technology / engine used by Kentico 10 as described in Setting Up Page Scanner Indexes ?

The reason I'm asking for is because I would like to consider it for use in a project with a custom crawler that can sit outside Kentico, and still allow it to have inherent compatibility with the Kentico platform.

+7

web-crawler kentico

John k Aug 31 '17 at 16:14

source share

3 answers

Marnix van valen · Answer 1 · 2017-12-13T21:18:01+0000

As far as I can tell from the Kentico 10 source code, the crawler used by Kentico SmartSearch is fully patented. It does not use a third-party library.

Loads the contents of a page using System.Web.HttpWebRequest . The full content is returned back to the SmartSearch index as a string. After that, it goes through text extraction and feeds it to Lucene for indexing.

It will not be easy to use Kentico SmartSearch for an external crawler. We usually stay away from the crawler because it is quite expensive to execute compared to the standard index, which retrieves data directly from the database.

Kentico supports some scheduled tasks in the Windows service , but not search tasks.

Please note that Kentico SmartSearch does not actually crawl the site by opening links. It uses the content tree to find out what content it should index. If you want to index other content, for example, from the system with which you are integrating, you need to implement a custom search service, as described here .

One thing that will work is for the external process to crawl any content you want to index and put the original HTML content in the repository. Then write a custom SmartSearch index that retrieves data from the Kentico indexing store. If you index Kentico-driven content, you can take it to the next level by connecting to document events. This should allow you to crawl pages only when they refresh.

Mike wills · Answer 2 · 2017-08-31T21:50:16+0000

Kentico uses Lucene.NET . This is a great solution for stand-alone projects. I used it to manage a custom web API hosted in Azure.

Mike

Chetan sharma · Answer 3 · 2017-09-01T04:33:28+0000

Lucene uses Nutch http://nutch.apache.org/ , which is an open source search engine for indexing web content. This is part of the whole structure that lucene offers.

Web Track Engine Used by Kentico 10

More articles: