Similarity Forum

General Category => News => Topic started by: Admin on November 16, 2009, 19:00:37

Title: Version 1.0.0 released
Post by: Admin on November 16, 2009, 19:00:37
+ New tag editor (results tab)
 + Some improvements in comparing algorithm
 + Some visual improvements
 - Many small visual/translation bugs fixed

Release version

If you interesting in translating Similarity in your language write to support email.
Title: Version 1.0.0 released
Post by: emn13 on November 16, 2009, 19:22:39
Interesting date there for the 1.0 release ;-).

Good work, though!
Title: Version 1.0.0 released
Post by: ferbaena on November 20, 2009, 04:37:18
version 0936 used the 4 cores and 8 threads on I7 860

the new version 1.0 barely uses 4 threads and the performance decrease is noticeable

the edit option is nice though

keep up the good work
Title: Version 1.0.0 released
Post by: Admin on November 20, 2009, 18:20:40
ferbaena
yes, this is bug, used only single thread, download new version.
thanks for your message.
Title: Version 1.0.0 released
Post by: ferbaena on November 23, 2009, 18:03:31
Yes new 101 now behaves like the 0936, all cores, all threads.


Problem:

I know for a fact that there are two duplicate files on two different

folders (actually there are more but this example is with two).

I create a new folder and make copies of the two duplicate files on that folder.

I run Similarity (content only) on this new folder with different settings
but it is only when content is down to 0.65 that it detects the duplicates.

Now the big problem:

I have around 72.000 mp3 to scan  The cache is 74150

If I set the content at 0.65 there will be more than a couple of million
duplicates by the end of the scan (rigth now I tried scanning 29404 and it has found 400016 duplicates, having checked only 7462) and the experimental algorithm will take a couple of days to finish.

Now the question:

Why it takes a setting of 0.65 to find these duplicates now, if running the program before with content settings between 0.85 and 0.95 found the majority of the others?

I know it is difficult and it's not a Similarity-only problem.

I bought a license for Phelix from Phonome Labs a couple of years ago and it does not find all of the duplicates either.

Thank you
Title: Version 1.0.0 released
Post by: Admin on November 24, 2009, 10:42:52
ferbaena
it's empirical value,
whats does experemental algoritm show on this 2 files ? (only 2 this files, don't need others to scan)
Title: Version 1.0.0 released
Post by: surbaniak on November 26, 2009, 20:24:41
TAG EDITOR   looks amazing !
(Will give more feedback after I work with it for a while)
Title: Version 1.0.0 released
Post by: surbaniak on November 26, 2009, 20:42:31
Found first problem with the TagEditor (minor):
The Duration and Size fields are interchanged in the file list table below.
Title: Version 1.0.0 released
Post by: surbaniak on November 26, 2009, 20:47:38
Found second problem with the TagEditor (minor):
First I applied the string "Test" to the Album field. <-Worked great.
Then I tried to apply string "" (empty string) to the Album filed. <-Did not work.Field remains populated with "Test".
So now i can't blank out that field,   likely all STR fields behave that way.  I think <empty string> should be a valid entry ....  or did you reserve it as a special value ?
Title: Version 1.0.0 released
Post by: gbowers on December 24, 2009, 23:59:25
ferbaena:
Problem:

I know for a fact that there are two duplicate files on two different

folders (actually there are more but this example is with two).

I create a new folder and make copies of the two duplicate files on that folder.

I run Similarity (content only) on this new folder with different settings
but it is only when content is down to 0.65 that it detects the duplicates.

Now the big problem:

I have around 72.000 mp3 to scan The cache is 74150

If I set the content at 0.65 there will be more than a couple of million
duplicates by the end of the scan (rigth now I tried scanning 29404 and it has found 400016 duplicates, having checked only 7462) and the experimental algorithm will take a couple of days to finish.

Now the question:

Why it takes a setting of 0.65 to find these duplicates now, if running the program before with content settings between 0.85 and 0.95 found the majority of the others?

I know it is difficult and it's not a Similarity-only problem.

I bought a license for Phelix from Phonome Labs a couple of years ago and it does not find all of the duplicates either.

Please explain how resolved this problem via following quote:

Admin:
ferbaena
it's empirical value,
whats does experemental algoritm show on this 2 files ? (only 2 this files, don't need others to scan)

 
Thanks
Title: Version 1.0.0 released
Post by: Admin on December 26, 2009, 03:13:41
gbowers
simple, i need to know, is experemental algorithm shows better results or no.
Title: Version 1.0.0 released
Post by: ferbaena on January 26, 2010, 03:03:19
The problem has not been solved...


  I am trying the newest version 110

6 files on the folder (3 repeats)

Compare method: Content only; Experimental enabled

1.00 down to 0.89 finds 0  
0.88 down to 0.76 finds 2  experimental: 63.1% each
0.75 down to 0.72 finds 4  experimental: 78.3% and 63.1% pairs
0.71 down finds 6          experimental: 63.1%,  78.3$ and 5.9%


Does the setting during the first pass when the cache is created affects the results for future comparisons at different settings?

thank you
Title: Version 1.0.0 released
Post by: ferbaena on January 26, 2010, 05:45:48
.. and speaking of the new version,  110

I liked the previous two-column presentation better.

Much easier to read the results.
Title: Version 1.0.0 released
Post by: Admin on January 26, 2010, 12:11:47
ferbaena
Please send this 6 files to our email we check them and algorithm.

PS. New version designed to show multiply duplicates to 1 file (in time we remove unneccasary records, like 1,2 and 2,1, but not remove triples 1,2,3 and 2,3,4)
Title: Version 1.0.0 released
Post by: gbowers on February 16, 2010, 18:43:32
Admin:
ferbaena
Please send this 6 files to our email we check them and algorithm.

PS. New version designed to show multiply duplicates to 1 file (in time we remove unneccasary records, like 1,2 and 2,1, but not remove triples 1,2,3 and 2,3,4)

Please indicate if you resolved problem shown above.

Thanks
Title: Version 1.0.0 released
Post by: Admin on February 16, 2010, 23:16:10
gbowers
You doesn't send us any files to test.
Title: Version 1.0.0 released
Post by: ferbaena on May 25, 2010, 01:49:59
The problem has not been solved...


I am trying this on version 101; it's easier to compare and read results

14 files on the folder (7 repeats)

Compare method: Content only; Experimental enabled.

Tags slider on 0.00.

Sensitivity slider:

1.00 down to 0.96 finds 0  
0.95 down to 0.92 finds 1
0.91              finds 2
0.90 down to 0.88 finds 3
0.87 down to 0.74 finds 4
0.73 down to 0.72 finds 5
0.71 down to 0.66 finds 6
0.65 down to 0.49 finds 7
0.49 down finds  too many non-repeats        


Do not get me wrong, your program has helped me find a lot of duplicate files.

I do not pretend you find all the duplicates for me because I understand there are too many variables, specially with files that come from different sources, files that have been poorly ripped, files truncated at the beginning or the end, etc. but this sample shows 7 songs that have duplicates (experimental results over 70%) and the sensitivity has to go down in seven runs to find them all.

Dealing with 14 files is not that bad, but dealing with over 70.000 the results are overwhelming as I wrote to you before and besides it takes more than 3 days for every run on each setting of the sensitivity slider.

That I had written almost two months ago;

Today (05/24/10), I downloaded the new version 120 and tried again.

The results with the 14 files are about the same; only the settings for the now called Content slider are a little off,  BUT...

...the speed working on the other folders is remarkable and it keeps finding duplicates at the regular setting of 90% for the Content.

At this pace I can rescan all the folders with different settings in a couple of hours.

The Content Precise I kept OFF; I still do not understand the philosophy behind the network connection for it to work.

As for the presentation, I will have to live with it; the old one (101) was easier to read.

Congratulations,

Keep up the good work

ferbaena
Title: Version 1.0.0 released
Post by: Admin on May 25, 2010, 18:19:08
We understand our mistake by adding sensivity, and removed it in beta, all results shown as thay are, no more scaling and other mathematical things. Try to use "presice" algorithm this is old experemental but very fast. Network needed to us for future algorithms.