Percent similarity analysis (Java)

I have the following situation:

String a = "A web crawler is a computer program that automatically scans the Internet World Wide Web"; String b = "The web scanner computer program is browsing the World Wide Web";

Is there any idea or standard algorithm for calculating percent similarity?

For example, in the above case, the manually-estimated similarity should be 90% ++.

My idea is to tokenize both strings and compare the number of agreed tokens. Something like (7 tokens / 1 0 tokens) * 100. But, of course, for this method it is generally ineffective. Comparing the number of matching characters also seems inefficient ....

Can anyone give some recommendations?

Above is part of my project, a plagiarism analyzer.

Therefore, the words matched will be exactly the same without any synonyms.

The only question in this case is how to calculate a fairly accurate percentage of similarity.

Thanks so much for any help.

+4
java similarity
Mar 06
source share
6 answers

As Conrad noted, your question is highly dependent on what you mean by “like.” In general, I would say that the following recommendations should be useful:

  • normalize input by reducing the word to the base form and typing it in lower case
  • use a list of word frequencies (readily available on the Internet) and make the word “similarity relevance” inversely proportional to the position in the list of frequencies
  • calculate the general similarity of sentences as the total similarity of words in both sentences divided by the general relevance of the similarity of sentences

You can refine the technique to include differences between text forms, sentence word order, synonym lists, etc. Although you will never get perfect results, you have a lot of room for customization, and I believe that in general you can get some very good similarity measures.

+5
Mar 06
source share

It depends on your idea of ​​similarity. Formally, you need to define a metric for what you consider to be “similar” rows in order to apply statistics to them. This is usually done on a hypothetical question: "How likely is the first line to be a modified version of the first line where errors were entered (for example, by entering it)?

A very simple but effective measure of such similarity (or rather, the inverse) is the editing distance of two lines, which can be calculated using dynamic programming, which takes O (nm) time in general, where n and m are the lengths of the lines.

Depending on your use, more complex measures (or completely unrelated ones, such as the soundex metric ) may require measures.

In your case, if you directly apply token matching (i.e. a simple number of words), you will never get similarities> 90%. To obtain such high similarity in a meaningful way, an extended semantic analysis is required. If you do this, please post an article because this is still an unresolved issue.

+4
Mar 06 '10 at 16:01
source share

I am the second that Conrad Rudolph said.

Others may recommend different distance measures. What I'm going to say accompanies them, but looks more at the problem of semantics matching.

Given what you seem to be looking for, I recommend applying some standard text processing methods. All of them have a potential decline, so I list them in order of both application and complexity in order to behave well

  • Separation of sentences. Find out your comparison units.
  • delete stop word: remove a, an, of, etc.
  • bag of words percentage: what percentage of the total word match, regardless of order
  • (much more aggressive) you can try the synonym extension, which considers synonyms as matching words.
+2
Mar 06 '10 at 16:10
source share

The problem with this question is this: the similarity can be either a humanized similarity (as you say "+ - 90% similarity"), or statistical similarity (Kondrad Rudolph's answer).

Human resemblance can never be easily calculated: for example, these three words

cellphone car message mobile automobile post 

The statistical similarity is very low, but in fact it is quite similar. Thus: it will be difficult to solve this problem, and the only thing I can point out to you is Bayesian filtering or artificial intelligence from Bayesian networks .

+1
Mar 06 '10 at 16:10
source share

One of the common measures is Levenshtein distance, a special case of line editing distance. It is also included in the apache string util library

+1
Mar 06 '10 at 16:12
source share

The longest common subsequence is well known as an indicator of line fuzziness, which is implemented in dynamic programming

0
Mar 06
source share



All Articles