Search Engine Analyzer Flowchart

Question

Search Engine Analyzer Flowchart

Do you guys know where I can find a search engine parser design scheme? I need to understand how it handles user input. What functions / algorithms are used? conditions. and etc.

It should not be Google.

Updated search parser question

+6

search-engine

forme Jan 9 '10 at 5:10

source share

2 answers

Anatomy of a large-scale hypertext search engine Sergey Brin and Lawrence Page http://infolab.stanford.edu/~backrub/google.html

+2

Sri Jan 9 '10 at 5:16

source share

Lothar · Accepted Answer · 2010-01-09T10:00:43+0000

First you need to better understand search engines. Usually

1) A web crawler that receives the documents that you want to add to the search space. This usually completely goes beyond what you call a "search engine."

2) an analyzer that takes a document and breaks it into indexed text fragments. If it usually works with different file formats, human languages and pre-processing of text is performed in some fixed records and stream text. Linguistic algorithms are also used here (for example, stalkers - search for a Porter streamer to get simple).

3) An indexer that can be as simple as an inverted list of words per document or as complex as you want if you try to be as smart as google. Building an index is truly the magic part of a successful search engine. Usually there are several ranking algorithms that are combined.

4) An interface with an optional query language. This is what google is really bad for, but as you can see for success in googles, it may not be that important for 98% of people. But I really miss it.

I think you are asking for (3) an indexer. Basically there are 2 different types of algorithms that you will find in the classic literature on finding information. Vector space model and logical search. This is later easy, just check to see if the search words are inside the document and return a boolean. Each search term can be assigned a corresponding probability. And for different search terms, you can use the Bayesian probability to summarize the match and add the return of the highest ranked documents. A vector model considers a document as a vector of all its words, you can build a scalar vector product between documents to judge whether they are close to each other - this is a much more complex theroid. The father of IR (information search) was Gerald Salton, you will find a lot of literature under his name.

This was the state of the art of artificial intelligence until 1999 (I wrote a thesis on the usenet news search engine in 1998). Then google came in and the whole theory went into the trash with academic stupidity and mischievous mistakes.

Google was not based on the basic theory of IR. Read the link that Srirangan gave you. Its simply ad hock function is relevant for many sources. You will not find anything in this area next to the blablabla white paper ad. These algorithms are business secrets and capital of search engine companies.

For simple search engines, look at the lucence library or dtsearch, which has always been my choice for an embedded search engine library.

In the open source world, there are not many code samples or available information on IR technology. Most of them, like lucense, simply implement the most primitive operations. You have to buy books and go to the university library to access the scientific literature.

As a reference, I would recommend starting with this book the link text alt text http://ecx.images-amazon.com/images/I/41HKJYHTQDL._BO2,204,203,200_PIsitb-sticker-arrow-click,TopRight,35,-76_AA240_SH20_OU01_. jpg

Search Engine Analyzer Flowchart

More articles: