Comparison of the names of composers or How to find a "fairly close" match between two lines?

After finishing two cs classes, I started working on a personal Java project. I am writing a program that will browse the music collection and try to set the "Composer" tag by looking at the file name and meta tags. I compare them to a composition list that I created as a simple text file. My question is this:

What is a good method of comparing two strings to try to find the best match? For example, in my case, suppose I have a file called "Pulenc - Gloria in excelsis Deo.flac". On my list of composers, I have Pulenc, Francis. I want to be able to read "Pulenc" and see that it is very close to "Poulenc", so that I can set the composer tag correctly. A friend suggested that I study the Cosine Distance (which I had never heard of before), and another recommended the Levenshtein distance. Do they have a good approach or are there other methods that may work better?

+4
source share
6 answers

It seems that Levenshtein Distance is exactly what you need. Cosine Distance seems to be dealing with longer texts, and phonetic algorithms like Soundex are likely to give poor results for names, most of which are not intended to be spoken using English proclamation rules.

+5
source

Here's a project called SimMetrics , run by the University of Sheffield in the UK, that can help you. I wrote a little about this on my blog from a .NET perspective, but I believe the project also has a Java implementation.

+2
source

Levenshtein’s distance is a reasonable idea, although if you have many composers in your system, this may not work well. Unlike Soundex (or Metaphone , or NYSIIS ), distance editing algorithms allow you to compare your name with the spelling composer with all other composers in the system. Depending on how much there is, this may take some time.

As a (premature?) Optimization, it can only be worth calculating the Levenshtein distance for composers whose name begins with the correct letter.

0
source

Peter Norwig wrote the wonderful play, "How to Write a Spelling Corrector," which you may find useful and that you can customize your specific needs.

0
source

I think in your case, Damerau-Levenshtein distanc e should work fine. If you have more data, use it. In the absence of a good algorithm, a large amount of data can be compensated.

0
source

All Articles