Algorithm for finding articles with similar text

I have many articles in the database (with title, text), I am looking for an algorithm to search for X of the most similar articles, for example, the Stack Overflow "Questions Related" when you ask a question.

I tried searching the site, but found only pages about other problems with similar text, something like comparing each article with all the others and saving similarities somewhere. SO does this in "real time" in the text I just printed.

How?

+55
string language-agnostic similarity
Oct 29 '08 at 14:16
source share
14 answers

Changing the distance is not a likely candidate, as it depends on the spelling / vocabulary and is much more expensive than Will prompting you to believe, given the size and number of documents that you are really interested in looking for.

Something like Lucene is the way to go. You index all your documents, and then, when you want to find documents similar to this document, you turn your document into a query and search by index. Internally, Lucene will use tf-idf and an inverted index so that the entire process takes a period of time proportional to the number of documents that could match, rather than the total number of documents in the collection.

+32
Oct 30 '08 at 23:36
source share

It depends on your definition of a similar one.

The edit-distance algorithm is a standard algorithm for Latin dictionary sentences and can work on whole texts. Two texts are similar if they have basically the same words (letters eh) in the same order. So the following two book reviews would be pretty similar:

1) "This is a great book"

2) "These are not great books"

(The number of letters to delete, insert, delete, or change to rotate (2) in (1) is called the “edit distance.”)

To implement this, you will want to visit each review programmatically. It may not be as expensive as it sounds, and if it is too expensive, you can perform comparisons as a background task and store the n-most similar ones in the database field itself.

Another approach is to understand the structure of (Latin) languages. If you separate short (non-capitialised or quoted) words and assign weight to words (or prefixes) that are common or unique, you can compare Bayesian comparisons. The following two book reviews can be matched and considered similar:

3) "The French Revolution was blah blah blah blah blah blah blah, France." → France / French (2) Revolution (1) War (1) Peace (1) (note that the dictionary is used to unite France and France)

4) "This book is a blah blah aa revolution in French cuisine." → France (1) Revolution (1)

To implement this, you would like to identify the keywords in the review when it was created / updated, and to search for similar reviews, use these keywords in the where-query sentence (ideally "full text" if the database supports it), it’s possible , after processing the results obtained to count the candidates found.

The books also have categories - thrillers installed in France, similar to historical studies of France, etc.? The metadata behind the heading and text can be useful for storing relevant results.

+14
29 Oct. '08 at 14:57
source share

Tutorial at this link

+9
Oct 29 '08 at 14:21
source share

I suggest indexing your articles using Apache Lucene , a high-performance, full-featured text search library written entirely in Java. This technology is suitable for almost any application that requires full-text search, especially cross-platform. After indexing, you can easily find related articles.

+3
Oct 29 '08 at 14:21
source share

One common algorithm is the Self-Organizing Map . This is the type of neural network that automatically classifies your articles. Then you can simply find the location where the current article is located, and all the articles next to it are linked. An important part of the algorithm is how you vector quantize your data . There are several ways to do this with text. You can hash your document / title, you can count words and use them as an n-dimensional vector, etc. Hope this helps, although I may have opened a Pandora's box for you to travel endlessly in AI.

+2
Oct 29 '08 at 15:00
source share

SO makes comparisons only by heading, and not by the main body of the question, therefore only on fairly short lines.

You can use your algorithm (I don’t know how it looks) in the article title and keywords. If you have more time to burn, as well as the abstract of your articles.

+1
Oct 29 '08 at 14:22
source share

Lucene's second sentence for full-text, but note that Java is not a requirement; .NET port available . Also see Lucene's homepage for links to other projects, including Lucy, port C.

+1
Oct 29 '08 at 14:31
source share

Perhaps your search is what paraphrasing does. I have only a superficial knowledge of this, but rephrasing is a natural language concept to determine if two pieces of text really mean the same thing — although completely different words can be used.

Unfortunately, I do not know any tools that will allow you to do this (although I would be interested to find it)

+1
Oct 29 '08 at 14:33
source share

You can use the SQL Server full-text index to get smart comparisons, I believe SO uses an ajax call that executes a query to return similar questions.

What technologies do you use?

0
Oct 29 '08 at 14:19
source share

If you are looking for words that look like a wound, you can convert to soundex and the words soundex to match ... worked for me

0
Oct 29 '08 at 14:50
source share

I tried some method, but nobody works well. One can get a relatively rich result: Firstly: get the Google SimHash code for each paragraph of the entire text and save it in the database. Second: index for SimHash code. Third: process your text, compared as described above, get the SimHash code and find all the text at the SimHash index, which differs from the Hamming distance as 5-10. Then compare the simulation with the vector-vector. This may work for big data.

0
Jul 22 '13 at 6:47
source share
0
Mar 11 '16 at 6:29
source share

The link in @ alex77 answers Sorensen-Dice Coefficient , which was independently opened by the author of this article - the article is very well written and worth reading.

I used this ratio for my needs. However, the initial coefficient may give erroneous results when working with

  • pairs of three letter words that contain one spelling error, for example. [and,amd] and
  • pairs of three letter words that are anagrams, for example. [and,dan]

In the first case, Dice erroneously reports a zero coefficient, and in the second case, the coefficient becomes 0.5, which is erroneous.

An improvement has been proposed which essentially consists of taking the first and last characters of a word and creating an additional bigram.

In my opinion, improvement is really only required for three letter words - in longer words, other bigrams have a buffering effect that covers the problem. My code that implements this improvement is below.

 function wordPairCount(word) { var i,rslt = [],len = word.length - 1; for(i=0;i < len;i++) rslt.push(word.substr(i,2)); if (2 == len) rslt.push(word[0] + word[len]); return rslt; } function pairCount(arr) { var i,rslt = []; arr = arr.toLowerCase().split(' '); for(i=0;i < arr.length;i++) rslt = rslt.concat(wordPairCount(arr[i])); return rslt; } function commonCount(a,b) { var t; if (b.length > a.length) t = b, b = a, a = t; t = a.filter(function (e){return b.indexOf(e) > -1;}); return t.length; } function myDice(a,b) { var bigrams = [], aPairs = pairCount(a), bPairs = pairCount(b); debugger; var isct = commonCount(aPairs,bPairs); return 2*commonCount(aPairs,bPairs)/(aPairs.length + bPairs.length); } $('#rslt1').text(myDice('WEB Applications','PHP Web Application')); $('#rslt2').text(myDice('And','Dan')); $('#rslt3').text(myDice('and','aMd')); $('#rslt4').text(myDice('abracadabra','abracabadra')); 
 *{font-family:arial;} table { width:80%; margin:auto; border:1px solid silver; } thead > tr > td { font-weight:bold; text-align:center; background-color:aqua; } 
 <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min.js"></script> <table> <thead> <tr> <td>Phrase 1</td> <td>Phrase 2</td> <td>Dice</td> </tr> <thead> <tbody> <tr> <td>WEB Applications</td> <td>PHP Web Application</td> <td id='rslt1'></td> </tr> <tr> <td>And</td> <td>Dan</td> <td id='rslt2'></td> </tr> <tr> <td>and</td> <td>aMd</td> <td id='rslt3'></td> </tr> <tr> <td>abracadabra</td> <td>abracabadra</td> <td id='rslt4'></td> </tr> </tbody> </table> 

Note the deliberate typo in the last example: abraca dabra vs abraca badra . Despite the fact that no bigram correction is applied, the coefficient is 0.9. With a correction, this would be 0.91.

Hope this helps others who come across this thread.

0
Feb 27 '17 at 9:44
source share

The easiest and fastest way to compare the similarities between the theses is probably to use the concept of typing. Convert abstract texts to lots of words first. Then check how each set overlaps. The Python recruitment function is very efficient for this task. You would be surprised to see how well this method compares with those “similar / related documents” provided by GScholar, ADS, WOS or Scopus.

-one
Apr 18 '15 at 20:04
source share



All Articles