I am looking for some recommendations about which methods / algorithms I should research in order to solve the following problem. I currently have an algorithm that groups similar mp3 sound files using acoustic fingerprints. In each cluster, I have all the different metadata (song / artist / album) for each file. For this cluster, I would like to select the "best" song / artist / album metadata that matches an existing row in my database, or if there is no better match, decide to insert a new row.
There are usually some correct metadata for a cluster, but individual files have many types of problems:
- Artist / songs are completely incorrectly named or just slightly erroneous.
- missing artist / song / album, but the rest of the information is
- the song is actually a live recording, but only some of the files in the cluster are marked as such.
- there may be very little metadata, in some cases just the file name, which may be artist - song.mp3, or artist - album - song.mp3, or another change
A simple voting algorithm works well enough, but I would like to have something that I can teach a large set of data that can bring up more nuances than what I have now. We will be very grateful for any links to documents or similar projects.
Thank!
source
share