Similarity Forum

General Category => Bugs => Topic started by: hsei on July 10, 2010, 11:01:56

Title: false 100% similarity
Post by: hsei on July 10, 2010, 11:01:56: The program seems to scan for similarity only about one minute at the beginning. Even if two songs differ in length by minutes, they give a 100% score if they start the same (e.g. a life CD vs. the first track).
This may be unavoidable because of performance but it is very dangerous if you rely on automark: one of the two is deleted even though they differed greatly.
Even worse: If there is a crippled track with a missing piece in the middle, it will be ranked as 100% similar and in worst case the complete track is deleted and the damaged one remains.
A configurable limit of allowable track length difference (file size would be another topic) would be very nice. At least there should be a warning if track times differ considerably (by e.g. a red color of the duration entry). The loss in performance for that should be neglible.
Title: false 100% similarity
Post by: Admin on July 10, 2010, 21:21:52: Similarity designed for scaning music compositions and yes it's scans only 1 min of song. We think about how to solve problem with long durations.
Title: false 100% similarity
Post by: djluckyluciano on July 11, 2010, 01:31:41: Hi,
i am confudsed of 70 % similarity of two titels one is an mega mix with 70 minutes
the other a short version of an song with 3 minutes...
Title: false 100% similarity
Post by: hsei on July 11, 2010, 11:00:18: It's not only a problem of long durations: Having two files of e.g. 2 minutes with high similarity score and differing by 10 secs is a strong indication of corruption.
I actually use that for identifying corrupted files but at the moment it has to be done "manually" by looking for significant duration mismatches in high score groups.
Title: Re: false 100% similarity
Post by: FtMgAl on August 06, 2010, 20:28:14: The first time I used the program I selected a small folder with about 100 tracks that I knew had no or not more than a couple of duplicates. The program found 22 supposed duplicates. The reason is these were mostly live performances and the first minute contained much applause.

I would suggest adding a criteria that the length must match within X%. If 2 tracks differ in length by more than 25% I find it hard to believe anyone would consider that similar but with a 0-100% option even people who would could have that option. And, as someone else mentioned, eliminating duplicates by track length could significantly improve speed.

You might also want to consider using the second minute to reduce the false positives on live tracks.
Title: Re: false 100% similarity
Post by: Admin on August 07, 2010, 18:22:47: Quote from: FtMgAl on August 06, 2010, 20:28:14
...

duration test will be added in future versions.
Title: Re: false 100% similarity
Post by: hsei on October 24, 2010, 09:29:43: The newly introduced duration check helps to get rid of most of false positives, but a few 100% "precise" pairs with equal length still remain. They can be easily identified by their tag score below 10% and standard score below 50%, but the implication is: You still can't rely on a totally automatic removal of duplicates, you have to look at the lists.
A hint: All false positive pairs I found had durations below 1 min. So it is not a severe bug, but it is one.
Title: Re: false 100% similarity
Post by: TBacker on October 30, 2010, 05:09:44: Quote from: Admin on July 10, 2010, 21:21:52
Similarity designed for scaning music compositions and yes it's scans only 1 min of song. We think about how to solve problem with long durations.

I'm a new user, but I am a radio broadcast engineer with a bit of experience writing some audio apps for my job and personal use (VB6/VB.Net).

How about taking 3 or 4 short (30 second) samples across the length of a file. Say a 30 second sample at 0%, 25%, 50%, and 75% of the length of valid audio data (ignore those metadata headers / tails!). You would have to seek in past any silence at the head for the first sample (as the silence can vary even if the cut is the same).

In theory this would produce a "fingerprint" representative of most of the audio without having to scan the whole thing, and more accurate than judging the whole file by one sample.

If this data is compared to a duplicate, and the duplicate is the same audio and length, the data from each of the 4 samples should match up waveform-wise. If the length is different on the duplicate, say an extra interlude on a remix, the last 2 or 3 samples will not be the same as the original.

This would also detect if a file is corrupted half way through - samples 1 and 2 might match, but 3 and 4 are random noise on the bad cut.

One last caveat - I don't know how your comparison code works, but if the levels are different between the original and dupe, you would need to compensate (make the highest peak of the sample match, i.e. normalize the low sample to the hotter one) before comparing the waveforms.

Sorry for the long post :-[
Title: Re: false 100% similarity
Post by: Admin on October 30, 2010, 16:46:09: Thanks for your message, we already fixed problem with 100% similarity, publically fix will be available in next version.

About duration, problem in speed, the more you decode, more time need to analyze file. We must balance between speed/quality.
But thanks for you comment we'll think about your ideas.
Title: Re: false 100% similarity
Post by: hsei on October 31, 2010, 19:49:35: a) The proposal of TBacker does not necessarily mean more effort: Taking e.g. three 20 sec excerpts at begin, middle and end results in approximately the same decoding time as for a single 60 sec probe, but gives higher reliability. There is a little bit more trouble at the borders of the excerpts, but dropped samples in one file and drastically different fade-outs (that are missed in the current version) would then show up most likely. This is probably worth the small loss in speed.

b) You can only be sure to detect all corrupted frames if you decode the whole file. That's clearly a matter of balancing speed vs. quality.

c) Normalizing to the highest peak of a sample/excerpt would not be an good idea. The standard approach for comparison in the frequency domain is to normalize to the overall average (or in other words: the component at frequency bin 0).
Title: Re: false 100% similarity
Post by: hsei on October 31, 2010, 19:53:18: @admin: The last posts would better fit to wishlist.
Title: Re: false 100% similarity
Post by: TBacker on October 31, 2010, 19:56:20: Quote from: hsei on October 31, 2010, 19:49:35
c) Normalizing to the highest peak of a sample/excerpt would not be an good idea. The standard approach for comparison in the frequency domain is to normalize to the overall average (or in other words: the component at frequency bin 0).

I guess my point wasn't clear in this respect. I basically meant that the amplitudes of the two samples should be made to match (in the compare procedure) before frequency analysis to insure the best accuracy.
Title: Re: false 100% similarity
Post by: Admin on November 01, 2010, 17:40:10: Thanks for your comments, we think about some modifications in algorithms, but this is not simple as seems.
Title: Re: false 100% similarity
Post by: GIL on January 07, 2011, 12:28:58: I understand the performance limitation, but I would prefer to be able to set the sample length.
Better wait than erase the wrong track.
Title: Re: false 100% similarity
Post by: 7b683d4548 on January 10, 2011, 13:33:03: How about uniquely identifying a song based on discogs release, amplifind id, musicbrainz unique id or such?

Ought to save you lots of scanning.

(http://img440.imageshack.us/img440/9987/tagsz.th.png) (http://img440.imageshack.us/i/tagsz.png/)
Title: Re: false 100% similarity
Post by: hsei on January 10, 2011, 20:18:18: The clue point of similarity is that it works just by comparing two tracks acoustically, not by comparison with a database entry. Musicbrainz IDs are no longer updated, they were replaced by Picard IDs (PUID). Both cover only part of released material. If you are not in mainstream you may get matches for only a a minor fraction of your candidates (I get typically less than 10%). Additionally, the problem of different instances of the same recording (with differing quality) isn't adressed at all.
Title: Re: false 100% similarity
Post by: Admin on January 10, 2011, 20:57:28: Quote from: 7b683d4548 on January 10, 2011, 13:33:03
How about uniquely identifying a song based on discogs release, amplifind id, musicbrainz unique id or such?

Ought to save you lots of scanning.

Main idea of Similarity to don't depend on any Online databases or something. Similarity uses own mechanism of "acoustic fingerprint". Many files don't contain any information ID, they need to calculate this is ID via mathematic. It is time consuming calculation that Similarity can do itself.
Title: Re: false 100% similarity
Post by: mitul on March 06, 2012, 18:43:12: So how many of the suggestion are implemented in latest version? Specially inquiring about taking samples at different locations rather then first minute of song. If it is not implemented then is on timeline?
Title: Re: false 100% similarity
Post by: Admin on March 08, 2012, 18:47:54: Quote from: mitul on March 06, 2012, 18:43:12
So how many of the suggestion are implemented in latest version? Specially inquiring about taking samples at different locations rather then first minute of song. If it is not implemented then is on timeline?
We working on cofigurable comparing algorithm, bu it will be after introducing image comparing.
Title: Re: false 100% similarity
Post by: mitul on May 01, 2014, 16:09:14: Checking progress of this post....