General Category > General

Similarity seems not optimized for 300,000+ mp3 files

(1/2) > >>


I have a huge amount of mp3 files (300,000+) and usually use the following config for checking duplicates (collection vs new files):
Audio comparison method: precise 95%
Duration check: enable 95%
Skip video files

My collection is stored accros several drives: my laptop drive and 2 EHDD (USB 3.0)
The Similatity Cache shows a total of 393,693 files

And every time I want to search for duplicates have to deal with the following:
1) when I open the program, it takes between 1 and 3 minutes to load and the GUI to appear
2) then when I start the comparison, a couple of types i have compared all my files (I've not created groups) and it takes 4 to 5 full days to complete. First time I launched this comparison, I though that it took longer since the cache was build, but then launched again the same exact comparison and it took the same 4 to 5 days, isn't it supposed to take much less? I have not added nor deleted any file, just launched again the same comparison to check the difference between having the cache to build and having the cache already created

Are there any plans to further optimize similarity? to really take advantage of the cache and reduce dramatically the time? and also to process a huge collection more efficiently?


It requires way less than 300k tracks to be annoying, I tell you. Similarity keeps telling me I've got some twenty hours left, plus minus - it can do that for days. Down to sixteen now, two days after I completed a scan and hit a new, just to test.
My disk is USB2 only, if the bottleneck were read speed, then it would not be using 90 percent CPU all the time. For days.

That is the point, I don't know why the CPU usage is high but Similarity is taking ages to do its job.

My 2 EHDD are USB 3.0, but what concerns me is that it appears that the cache is useless (at least for a very high volume of files).

I can't believe that I performed a full scan, added several new files to the cache, and by running the same exact scan (wth no new added files) takes the same amount of time like the initial scan.

Any feedback comment from the developers?

Similarity algorithms isn't linear or even better logarithmic, they quadratic. Complexity of content based algorithm of Similarity is N^2. If directly calculated each new file need to be compared with all previous ones (it can't be searched by some index in relational databases, fingerprints can't be sorted to greater or lower).Example, for 300K files we have sum of arithmetic progression (1 + 300000) * 300000 / 2 = 90000300000 / 2 comparisions, compare it with 100K file for example (1 + 100000) * 100000 / 2 = 10000100000 / 2. You see comparing 300K file is 9 times longer then comparing 100K, not just 3 times. Even worse if you computer (all CPUs and GPUs) can compare 1mln fingerprints in 1sec (very, very fast computer), processing 300K files took 25hours.
To optimize this we added duration check, it dramatically decrease comparison count.
And we already working on new algorithm what can be used to compare 1mln of files and it will be linear, but it still far from completion.


[0] Message Index

[#] Next page


Go to full version