Author Topic: Replaygain support / ignore loudness component in similarity computation  (Read 10390 times)


  • Jr. Member
  • **
  • Posts: 7
    • View Profile
Currently, identical mp3's that differ only in loudness are not identified as identical (100% similar) by Similarity.

This is problematic since it reduces the accuracy of the similarity measurement: some files are less similar that they should be (those with amplification differences), and others are relatively more similar (those without such differences).


  • Guest
Yes, this is a problem. Fade durations is another. A pair of songs with 0% similarity in tags, different filename, bit rate and, of course, size. Length is different too, but 1 to 4 seconds, it´s just a longer fade-out. (Almost) whole sound content is identical. FOR ME, this song is 90% similar, at least. Today Similarity "says" this pair is less similar than a classic music compared with a punk music. Really! With same situation, but 100% similarity and filename, give me about ~50% similarity.

At least, allow sort/filter/group songs by a compound sort key (Ex. %content + difference between lengths + %tags)

Another improvements:

Give more freedom to user. Each person better knows how to classify its collection.
Allow to add more columns info to browse (and to sort, filter and group), including calculated data (release an API with special variables)
Add a weighted overall score combining %content and %tag, and maybe another information.

I downloaded about 400.000 songs, now I have ~260.000 and I guess 40% is duplicated yet. With Similarity 9.360 I could down to ~190.000 (optimistic estimation). With improvements I can (relatively) safely and a reasonable time spending, down to 160.000 songs.

I am available for more information and suggestions.

Thank you very much.


  • Administrator
  • Hero Member
  • *****
  • Posts: 625
    • View Profile
That experemental algorithm show on this files ?


  • Guest
Experimental doesn't start yet. Now Similarity scan and cache only 26% of my songs. It´s progressively slower due to more pairs to compare. And to "help", my country had a huge blackout and I needed to start Similarity again. Cache saved a lot o time, but many extra hours was necessary. If experimental spend 3 or 4 times more time to compare I guess Similarity will finish in 2010.

Similarity is a good tool, but for today, huge collections like mine is pretty hard to manage. Better interface with more flexibility is a wanted improvement. I have to look deeper than I wanted to find duplicates. Similarity helps, but don't give me reliable and accessible information to manage my collection more automatic. I hope this weakness be temporary.

I think I have found some bugs in algorithm, but I will wait to post an accurated information .