How to find the percentage of similarity between two multi-line lines?

Question

How to find the percentage of similarity between two multi-line lines?

I have two multiline lines. I use the following code to determine the similarities between the two of them. It uses the Levenshtein distance algorithm.

public static double similarity(String s1, String s2) { String longer = s1, shorter = s2; if (s1.length() < s2.length()) { longer = s2; shorter = s1; } int longerLength = longer.length(); if (longerLength == 0) { return 1.0; /* both strings are zero length */ } return (longerLength - editDistance(longer, shorter)) / (double) longerLength; } public static int editDistance(String s1, String s2) { s1 = s1.toLowerCase(); s2 = s2.toLowerCase(); int[] costs = new int[s2.length() + 1]; for (int i = 0; i <= s1.length(); i++) { int lastValue = i; for (int j = 0; j <= s2.length(); j++) { if (i == 0) costs[j] = j; else { if (j > 0) { int newValue = costs[j - 1]; if (s1.charAt(i - 1) != s2.charAt(j - 1)) newValue = Math.min(Math.min(newValue, lastValue), costs[j]) + 1; costs[j - 1] = lastValue; lastValue = newValue; } } } if (i > 0) costs[s2.length()] = lastValue; } return costs[s2.length()]; }

But the above code does not work as expected.

For example, let's say that we have the following two lines: s1 and s2 ,

S1 → How do we optimize the performance? . What should we do to compare both strings to find the percentage of similarity between both? How do we optimize the performance? . What should we do to compare both strings to find the percentage of similarity between both?

S2-> How do we optimize tje performance? What should we do to compare both strings to find the percentage of similarity between both? How do we optimize tje performance? What should we do to compare both strings to find the percentage of similarity between both?

Then I pass the above line to the similarity method, but it does not find the exact percent difference. How to optimize the algorithm?

Below is my main method

Update

 public static boolean authQuestion(String question) throws SQLException{ boolean isQuestionAvailable = false; Connection dbCon = null; try { dbCon = MyResource.getConnection(); String query = "SELECT * FROM WORDBANK where WORD ~* ?;"; PreparedStatement checkStmt = dbCon.prepareStatement(query); checkStmt.setString(1, question); ResultSet rs = checkStmt.executeQuery(); while (rs.next()) { double re=similarity( rs.getString("question"), question); if(re > 0.6){ isQuestionAvailable = true; }else { isQuestionAvailable = false; } } } catch (URISyntaxException e1) { e1.printStackTrace(); } catch (SQLException sqle) { sqle.printStackTrace(); } catch (Exception e) { if (dbCon != null) dbCon.close(); } finally { if (dbCon != null) dbCon.close(); } return isQuestionAvailable; }

+7

java algorithm levenshtein distance

Stanly moses Jan 03 '17 at 5:44

source share

3 answers

Your similarity method returns a number from 0 to 1 (both ends inclusive), where one means that the lines are the same (the editing distance is zero).

However, in your authQuestion method authQuestion you act as if it returns a number from zero to 100, as this line indicates:

 if(re > 60){

You need to change this to

 if(re > .6){

Or

 if(re * 100 > 60){

+3

Erwin bolwidt Jan 03 '17 at 7:35

source share

Since you use all of your S1 in the where clause of your sql query, it will either find a perfect match or it won’t return any result at all.

As @ErwinBolwidt mentioned, if it returns nothing , then you isQuestionAvailable always remain false . And if he returns a perfect match , then you will definitely get 100% similarity .

What you can do: Use the substring of your S1 to search for questions matching this part.

You can make the following changes:

authQuestion method

 checkStmt.setString(1, question.substring(0,20)); //say

From the results, you can compare each result with your similarity question.

+1

Dhaval simaria Jan 03 '17 at 10:22

source share

Daniel · Accepted Answer · 2017-01-03T06:37:52+0000

I can offer you an approach ...

You are using the editing distance, which gives you the number of characters in S1, which you need to change / add / delete to turn it into S2.

So for example:

 S1 = "abc" S2 = "cde"

the editing distance is 3, and they are 100% different (given that you see it in some kind of char comparison char).

That way you can have an approximate percentage if you do

 S1 = "abc" S2 = "cde" edit = edit_distance(S1, S2) percentage = min(edit/S1.length(), edit/S2.length())

min is a workaround to handle cases where the strings are very different, for example:

 S1 = "abc" S2 = "defghijklmno"

so that the editing distance will be greater than the length S1, and the percentage should be more than 100%, therefore, possibly dividing into large sizes should be better.

hope that helps

How to find the percentage of similarity between two multi-line lines?

More articles: