If you use hashes to compare two datasets, ideally they would have to have exactly the same input every time to get exactly the same result (unless you miraculously chose two collisions with different inputs, which led to the same result). If you want to compare two MP3 files by combining the entire file, then the two sets of song data can be exactly the same, but since ID3 is stored inside the file, discrepancies can cause the files to be completely different. Since you use a hash, you will not notice that perhaps 99% of the two files are the same, because the outputs will be too different.
If you really want to use a hash for this, you only need the hash data of the sound, excluding any tags that may be attached to the file. This is not recommended if the music is ripped from CDs, for example, and the same disc is torn twice, the results can be encoded / compressed differently depending on the copy settings.
A better (but much more complicated) alternative would be to try to compare uncompressed audio data values. With a little trial and error with known inputs, it can lead to a decent algorithm. Doing this will be very difficult (if possible at all), but if you get something that is more than 50% more accurate, it will be better than going through it manually.
Please note that even an algorithm that can detect if two songs are close (for example, the same song torn under different parameters), the algorithm should be more complex than worth it to tell if the live version is something like studio version, If you can do this, there will be money!
And touching on the initial idea of how to quickly determine if they are duplicates. A hash would be much faster, but much less accurate than any algorithm for this purpose. This is speed versus accuracy and complexity.
source share