How to create a related questions engine?

One of our large sites has a section where users can send questions to the website owner, which are personally evaluated by its staff. When the same question appears very often, they can add this specific question to Faq.

To prevent them from getting dozens of similar questions per day, we would like to provide a feature similar to the “Related Questions” on this site (stack overflow).

What are the ways to create these kinds of functions? I know that I need to somehow evaluate the question and compare it with the questions in faq, but how does this comparison work? Are keywords highlighted, and if so, how?

It may be worth mentioning that this site is built on the LAMP stack, so these are affordable technologies.

Thanks!

+6
php mysql recommendation-engine lamp
source share
5 answers

I don't know how Stack Overflow works, but I think it uses tags to find related issues. For example, on this issue, questions closest to each other have a recommendation-engine tag. I would suggest that matches in rarer tags count more than matches in common tags.

You can also see the term reverse frequency of a document .

+3
source share

If you want to create something like this from scratch, you should use something called TF / IDF: Term Frequency / Inverse document frequency. This means that in order to simplify it, you will find words in the query that are unusual in the corpus as a whole, and find documents that have these words.

In other words, if someone enters a query with the words “I want to buy an elephant” in it, then from the words in the query, the word “elephant” is probably the least common word in your corpus. The buy is likely to be next. Thus, you rank the documents (in your case, previous requests), how much they contain the word "elephant", and then how much they contain the word "buy". The words "I", "to" and "an" are probably in the stop list, so you completely ignore them. You evaluate each document (previous request, in your case), how many matching words are (weighting according to the frequency of the reverse document, that is, high weight for unusual words) and show the best ones.

I simplified it and you will need to read about it so that everything is correct, but it really is not so difficult to implement in a simple way. A Wikipedia page can be a good place to start:

http://en.wikipedia.org/wiki/Tf%E2%80%93idf

+4
source share

Given that you are working on the LAMP stack, you should make good use of MySQL's full-text search features . I believe that I am working on the principles of TF-IDF, and it should be pretty easy for you to create the “related questions” that you want.

+1
source share

There's an excellent O'Reilly book - Programming Collective Intelligence - which covers group discoveries, recommendations, and other similar topics. There are examples in Perl from memory, but I realized that it’s easy to understand from the background of PHP, and after a few hours I built something similar to what you need.

Yahoo has the extractor webservice keyword in http://developer.yahoo.com/search/content/V1/termExtraction.html

+1
source share

You can use spell checking, where corpus is the name / text of existing FAQ entries:

How do you implement "did you mean"?

0
source share

All Articles