Author Topic: Several questions about similarity (Read 41159 times)

lakecityransom · « **on:** May 19, 2010, 09:45:23 »

First of all, thank you for a great free program.. I have 50,000+ MP3s and many, many dupes. This program is doing a great job at finding them and not crashing or eating up memory (198,400 results at 63.8% completion, 2 days run time).

1. What do the Cache: and the "New: x/x" numbers represent? For example, Cache: 35,278, New: 32,141/50,193? I am a little confused, because the completion percentage has swayed back and forth.

2. I know there is clear cache button but I don't know the consequences or purpose of doing it? Does it help the search go faster next time if I closed similarity? I know I cannot currently save my result list so that is disturbing since this is a very long number crunching process, I just hope it doesn't crash!

3. In options there is a sensitivity option to content and tags. Currently it is set at .75 each, yet on each result I have many songs with content % matches of less than 1% or something unnecessary like 20%. Are these going to be purged at the end of the result processing? If so, if you change sensitivity settings during result computation will it use the new value at the end?

Some of these questions I could probably answer through my own experimentation with a small file set, but Similarity cannot be opened twice as it cannot access the cache that is in use. I figure that if I attempt another search on the 2nd opened similarity it will erase the cache or something and mess up my 2 day+ search? I just got a little trigger happy on using the program before completely understanding it.

Admin · « **Reply #1 on:** May 19, 2010, 20:08:32 »

1. Shows relation between already processed files to all processable files.
2. No, cache accelerates scanning, with cache no need to decompress file, analyze it again for comparing.
3. This behaviour will be changed in next release, there no more sensivity only threshold.

you can run Similarity with "/portable" argument it stores all caches in local folder, not shared user folder.

lakecityransom · « **Reply #2 on:** May 21, 2010, 11:02:47 »

I took some results that I had processed with the 'browse...' feature ie 20% match, 60%, 90% etc. and put them on a jumpdrive with similarity and did some sensitivity setting testing on another computer in the meantime. Different sensitivity changed percentages but I don't see the pattern... For example, if you were to say 90% sensitivity, are the percentages of similarity in relation to treating 90% song similarity counting as 100% similarity? In other words, if a song was 89% similar it would have showed up as ~99%?

I knew I was asking for trouble, but I tried to sort 175,000 matches by filename and either Similarity crashed or it was going to take ages. On the bright side, it was nice that it did not crash when sorting by content filter as I was doing. At any rate I had to start from scratch again. Its been about 2 more days and I'm nearing 175,000 matches again in the same time frame. I'm not sure how the cache helped? I was hoping the cache would quickly churn out results with 36,725 cached out of ~50,000+ but I don't see much of a time difference. No files were moved or altered in any way between the 2 runs.

Thanks for the replies and it is still a great program.

edit: I accidentally sorted by file again but it succeeded after a long wait

Admin · « **Reply #3 on:** May 21, 2010, 18:44:35 »

175 000 very huge amount of files, it's practically very hard to compare. In theory:

1) memory consumption to caching (32 bit) 2Gb/175000 = ~11 000 bytes per song cache (in current version 1.2 aproxim. 50-60Kb) - not possible in current implementaion
2) time, using current method "everything with everyone" it needs by arithmetic progression
(175000+1)/2*175000= 15 312 587 500 compares, if every compare takes 0,001 sec (impossible number) it completes in 177 days - very long in current implementaion

only one solution divide file to groups and only compare groups with other groups, not themself, this feature will be added in future versions - groups working.

lakecityransom · « **Reply #4 on:** May 21, 2010, 19:10:11 »

Oh no it was 175,000 results of duplicates, most false positives. Most of what I was interested in was 90%-100% range which would have cut that number dramatically. The file count was somewhere around 50,000-55,000. I thought the program called results from a cache file instead of recomputing, but you're saying the cache puts songs into memory for quicker comparison? From this method it would obviously not be possible to do with many songs. I must be wrong in my thinking, because computed results cannot be saved yet.

At any rate, similarity started crawling to a halt somewhere around 35,000 processed songs, (AMD 6400+ 3.2ghz dual core, 3GB free ram on XP). Of course I don't blame similarity for this. I just have to take it chunks at a time.

Was I right about how the sensitivity setting works by the way? I just wish I could tweak content matches to 90-100% would make life much easier.

Admin · « **Reply #5 on:** May 21, 2010, 19:23:43 »

lakecityransom
use version 1.2 where no more sensivity options. Disable tags, content comparing methods, enable only precise, put threshold to 60% or 80%, it must work well in this configuration. I think it slows down not in comparing on showing results.

cache will saved on disk every time than new file added to it and restored than similarity runs again.

lakecityransom · « **Reply #6 on:** May 21, 2010, 19:28:42 »

Oh great, beta right when I need it the most. I'll have you know I've been working on my music for a year so now that I am at this step at this point in time precision comparison is a godsend.

I'll give some feedback on it. Thanks a ton.

edit: Sorry I misunderstood, I mean threshold is very useful. The precision will be useful in the future I'm sure, I have a majority as mistagged files so it is not going to help me much in comparison to other libraries.

lakecityransom · « **Reply #7 on:** May 21, 2010, 21:19:06 »

So far I can tell you its working great and has churned through ~20,000 results out of ~20,000 in about 2 hours on content 90%. I did try content precision at first but it was going much slower, probably for the reason I explained above. It looks like you are right about the false positives taking up so much processing time.

Similarity - Home

Author Topic: Several questions about similarity (Read 41159 times)

lakecityransom

Several questions about similarity

Admin

Several questions about similarity

lakecityransom

Several questions about similarity

Admin

Several questions about similarity

lakecityransom

Several questions about similarity

Admin

Several questions about similarity

lakecityransom

Several questions about similarity

lakecityransom

Several questions about similarity

Quick Reply