Author Topic: Similarity scanned for three days; I stop it; restart. It starts from 0!!  (Read 8814 times)

cjfarmer

  • Jr. Member
  • **
  • Posts: 1
    • View Profile
Ok,

So it is suggested to put all the files in one parent folder.  My problem is two external hard drives crashed and the music was recovered and saved in multiple places.  I thought this program could tease out the 150,000 songs!  It took three days to scan, 95,000 duplicates, then wouldn't save!

Help, please.

hsei

  • Jr. Member
  • **
  • Posts: 67
    • View Profile
I would not recommend to work on such a large number of files in one pass. Even with 8 GB of RAM on a 64-bit system similarity tends to get unstable between 50k to 100k files. You can work on part of your data, move them to a master directory tree, and add more chunks of files in additional trees on later passes by using the group separation feature. Keep your cache between passes, this speeds up the process drastically. If you haven't lost your cache file, similarity should run much faster after a restart even though it seems to start from 0 again. On a 32 bit system your cache file might have exhausted the 2 GB RAM limit.
If you have enough RAM and CPU power, you can run several instances of similarity parallel with the /portable switch. By that you can use more than one cache (located in your program directory). You should use parallel copies of the similarity program (with a copy of the cache of previous runs) to make full use of this approach. If one instance crashes and is not restartable, you don't lose all of your work.
It's the general idea of modularity: Break down the problem to smaller ones which are easier to handle (divide and conquer).

Admin

  • Administrator
  • Hero Member
  • *****
  • Posts: 624
    • View Profile
    • http://www.smilarityapp.com
Thanks hsei,
Similarity doesn't designed for big amount of files, it can teoretically process a 200 000 of files, but this take a lot of time.
We now actively work on 4th algorithm, that has all properties of "precise", but can compare all elements stored on disk (like sql database), current implementation need contain all data in RAM, because every one compared with everything, it can't be optimized like binary search or other type of sorting.

hsei

  • Jr. Member
  • **
  • Posts: 67
    • View Profile
I suppose that will be a useful feature. Brute force comparison (everything with everything) is not feasible with large amounts of files. Reducing search space by thresholds like length was a first improvement and introducing the group vs. group feature brought large amounts to acceptable times.
The possibility of further restriction of fingerprint comparison to files only with sufficient coincidence in tags introduced in latest version 1.6.2 brought down processing times for me from hours to minutes. Of course you lose duplicates which are totally mistagged, but this can often be tolerated. A more exhaustive search can be done nevertheless later on.
I prefer software that can be tailored to ones needs by configuration.