ferbaena:
Problem:
I know for a fact that there are two duplicate files on two different
folders (actually there are more but this example is with two).
I create a new folder and make copies of the two duplicate files on that folder.
I run Similarity (content only) on this new folder with different settings
but it is only when content is down to 0.65 that it detects the duplicates.
Now the big problem:
I have around 72.000 mp3 to scan The cache is 74150
If I set the content at 0.65 there will be more than a couple of million
duplicates by the end of the scan (rigth now I tried scanning 29404 and it has found 400016 duplicates, having checked only 7462) and the experimental algorithm will take a couple of days to finish.
Now the question:
Why it takes a setting of 0.65 to find these duplicates now, if running the program before with content settings between 0.85 and 0.95 found the majority of the others?
I know it is difficult and it's not a Similarity-only problem.
I bought a license for Phelix from Phonome Labs a couple of years ago and it does not find all of the duplicates either.
Please explain how resolved this problem via following quote:
Admin:
ferbaena
it's empirical value,
whats does experemental algoritm show on this 2 files ? (only 2 this files, don't need others to scan)
Thanks
Admin:
ferbaena
Please send this 6 files to our email we check them and algorithm.
PS. New version designed to show multiply duplicates to 1 file (in time we remove unneccasary records, like 1,2 and 2,1, but not remove triples 1,2,3 and 2,3,4)
Please indicate if you resolved problem shown above.
Thanks