Search Engine Recommendations

This is more a matter of theory, not practice. I am working on a project that is a fairly simple directory of links. The entire model is similar to the Dmoz or Yahoo directory, except that each entry has certain additional attributes.

I have a hierarchical taxonomy working on all elements with a many-to-many relationship, now all entries are sorted into these categories, and everything works fine. Now, what use is a directory if there is no search option?

Here is a little more detail about my models. Each entry has a name, description, URL and several social profiles: YouTube, Twitter, Flickr and a couple of others. Each entry can have a logo attached to it and a hidden tag field. In addition, the title and description are stored in three different languages. So basically I would like the search results to be:

  • Relevant (including taxonomy)
  • Perhaps with logos
  • Perhaps with 100% completed profiles

I tried Sphinx and am currently working with Lucene, but it seems I am not getting the correct search in theory. I hope it makes sense that the completed entries should appear higher than the others, but I can not understand the numbers. I would not want unnecessary entries to be displayed on top if there is a simple coincidence of words in the entire description, since the headings are more relevant.

So my question is: are there any books, methods, or even other search engines (if Sphinx and Lucene aren't good enough) that you would recommend for this question ? Not only would I like to get full control over the search results and their ranking, but also give my visitors the correct and relevant information.

Links to interesting articles are also appreciated!

And No , I'm not trying to rebuild Google :)

Thanks:)

+7
search search-engine full-text-search lucene sphinx
source share
4 answers

I am pretty sure that Lucena is enough. We solved a similar problem and did it well. Here are some tips I can offer you to look back at my project at Lucene.Net.

Taxonomy:

  • The category is represented as an integer key in db, so each document has several instances of the "CATEGORY" field of type Number. For example, a document: [1,2,5,10, 'Wheel'] - means that the wheel belongs to each category.

Unrecognized fields (logos, social profile):

  • Of course, you can store values ​​without searching in lucene non-indexed fields. But we saved all the product information in the database to avoid the recovery of the Lucene index. Thus, Lucene owns only the product identifier and indexed but stored values ​​for the key fields.

Three languages ​​and several fields:

  • We have only 2 languages. Therefore, different product names can be stored in one Lucene document and refer to a single product identifier (as I write before the ID refers to the database). This allows you to search for a product even if a user query uses a combination of languages.
  • Obviously, the title, tags, and description have different weights for the search result. Lucene processes it by assigning field weight.
+4
source share

Great Book: Lucene in Action (2nd Edition)

When we started with Lucene, we had the first edition, you really need everything you need, step by step. Highly recommended. The second edition is updated for the latest and largest version (3.xx).

The Tf-Idf algorithm works very well on (large) texts, but if you have a similar record structure, this can have unpleasant consequences: documents with several terms are considered more “relevant” than documents with many conditions. With Lucene, you get it to work, but you have to contaminate your hands.

What you basically need to do is increase the title field , so it becomes more relevant. You can also change the scoring mechanism to assign higher scores to documents that have more information.

Enjoy. If you can't figure it out, there is great support on the Lucene mailing list .

+5
source share

I will try to add Matti, Duffy, and Carusell to the subtle answers. Basically, you are trying to improve the relevance of your search. I suggest you read Grant Ingersoll's "Debugging Search Relevance Search" and its Optimizing Search Capabilities in Lucene and Solr , as well as its Practical Relevance Slides .

For different languages ​​and for cutting, I suggest you use Solr . This is a search engine built using Lucene that is easy to use. It can support multiple languages ​​using a different Solr Core for each language.

+2
source share

Lucene or Solr would do the job. Solr is built on top of lucene, see here for details.

I would go with solr. Download + configure it quickly and easily. Start with a tutorial and my link building . Compliance should be perfect with solr and easily customizable.

Take a look at Dewfy and Matthijs Bierman and answer some good points.

Then select the smax request handler, and you may prefer documents with specific properties.

eg. for the percentage of the full profile, you define a separate field "profile_completness", then you can add profile_completeness to the bf (boostfunction) of the descriptor handler: the more complete the profile, the more these documents will be raised.

I already mentioned that you can easily adjust relevance: for example, you can configure bf on sth. e.g. bf=title^10 tags^5 profile_completeness^1

"Perhaps with logos" can be solved with the help of boost requests: bq=logo:[* TO *]^1 . Where logo:[* TO *] means "only documents containing the field logo"

To display a deeply nested category tree, you will need to create this tree in memory and submit solr with a special import. For this, we have a working application. You can use our approach

If you need more help, feel free to comment.

+1
source share

All Articles