Comparison of the names of composers or How to find a "fairly close" match between two lines?

Question

Comparison of the names of composers or How to find a "fairly close" match between two lines?

After finishing two cs classes, I started working on a personal Java project. I am writing a program that will browse the music collection and try to set the "Composer" tag by looking at the file name and meta tags. I compare them to a composition list that I created as a simple text file. My question is this:

What is a good method of comparing two strings to try to find the best match? For example, in my case, suppose I have a file called "Pulenc - Gloria in excelsis Deo.flac". On my list of composers, I have Pulenc, Francis. I want to be able to read "Pulenc" and see that it is very close to "Poulenc", so that I can set the composer tag correctly. A friend suggested that I study the Cosine Distance (which I had never heard of before), and another recommended the Levenshtein distance. Do they have a good approach or are there other methods that may work better?

+4

java string

Ben May 20, '09 at 12:55

source share

6 answers

Here's a project called SimMetrics , run by the University of Sheffield in the UK, that can help you. I wrote a little about this on my blog from a .NET perspective, but I believe the project also has a Java implementation.

+2

Zombiehehep May 20, '09 at 12:59

source share

Levenshtein’s distance is a reasonable idea, although if you have many composers in your system, this may not work well. Unlike Soundex (or Metaphone , or NYSIIS ), distance editing algorithms allow you to compare your name with the spelling composer with all other composers in the system. Depending on how much there is, this may take some time.

As a (premature?) Optimization, it can only be worth calculating the Levenshtein distance for composers whose name begins with the correct letter.

0

Simon nickerson May 20, '09 at 13:23

source share

Peter Norwig wrote the wonderful play, "How to Write a Spelling Corrector," which you may find useful and that you can customize your specific needs.

0

David koelle May 20, '09 at 13:30

source share

Apache Commons Lang lib implements a Levenshtein distance calculator: http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringUtils.html#getLevenshteinDistance(java.lang.String,%20java. lang.String)

0

matt May 21, '09 at 2:31

source share

I think in your case, Damerau-Levenshtein distanc e should work fine. If you have more data, use it. In the absence of a good algorithm, a large amount of data can be compensated.

0

Hao wooi lim May 21, '09 at 2:37

source share

Michael borgwardt · Accepted Answer · 2009-05-20T13:04:51+0000

It seems that Levenshtein Distance is exactly what you need. Cosine Distance seems to be dealing with longer texts, and phonetic algorithms like Soundex are likely to give poor results for names, most of which are not intended to be spoken using English proclamation rules.

Comparison of the names of composers or How to find a "fairly close" match between two lines?

More articles: