Author Topic: Similarity efficiency (speed) and recovery/restart.  (Read 12995 times)


  • Jr. Member
  • **
  • Posts: 3
    • View Profile
Similarity efficiency (speed) and recovery/restart.
« on: October 12, 2012, 21:24:24 »
I bought Similarity Premium about a year ago. I am currently running version 1.7.1.

Over the years I have been digitizing my entire LP and CD collection. In some cases I not only have the original vinyl LP, but the 1st CD pressing, and subsequent remasters. I am at the point where I would like to review similar albums, select the best audio version and make only that one available for a jukebox type application. Listening to each album and making a subjective analysis and selection is not possible. I purchased Similarity to give me a tool that would make a technical analysis of each track, provide an overall evaluation, and give me some guidance. In the end, before I remove the technically inferior versions, I listen to each track to confirm that the technical evaluation is in line with my subjective evaluation. Occasionally, a technically inferior version (eg exessive clipping, lower high frequency range) may in fact be the better sounding.

For my purposes, looking for duplicate tracks or comparing individual tracks to cherry pick the best, is not a requirement. Cobbling together a single album by selecting the technically best from several versions, results in an uneven and confusing listening experience. Neither is processing or comparing pictures a requirement.

My music collection is organized by Artist/ReleaseYear-Album Title-ReissueYear/Track. So I simply want Similarity to process each track sequentially, analyze each track, and log the result in alphabetic order (the same order it was processed in). This would result in each similar album to be logged near its predecessor and make it easy for visual comparison. Subsequent analysis using sorting of columns to identify particularly bad tracks or some sort of search and filter function would be useful.

Up to now I have been using Similarity to analyze the occasional album. I have been trying to learn how to interpret the results. I have only recently realized the relationship between the dynamic range analysis provided to me by a Foobar add-on and the Similarity Max (abs) field. I have reviewed the documentation for the Mean and Abs fields several times but still can't relate them to the listening experience.

I now want to use Similarity to process the complete collection to identify the best album versions. I am running the program on a backup computer in my home. It is an 'older' computer (older than my main computer but still sufficiently fast to provide 24/7 server functionality) that I have highly optimized for Similarity processing but it still takes about two days to process 30,000 files. The time is not a big issue. I turn Similarity on, come back in two days, and all would be good, except when an instability occurs in the program/computer occurs and I have to restart Similarity. I don't want to spend another two days to go through the whole analysis.

I have reviewed the forum for any similar experiences and came across a discussion here
If I understand the Admin response, Similarity analyses each file, puts results in memory, and does comparisons on the file. This naturally results in memory limitations, slowing down the process as more and more files are added, and if the results are not written to disk, all will be lost if the program fails. The point was that Similarity was not designed to process large collections.
The Admin also suggests that somehow, not all is lost, because when Similarity restarts it does not reanalyse the files in the cache.

I apologize for the long prologue to my questions but I want to make sure you understand where I am coming from'

1. While Similarity is analysing the status bar shows a total files and currently processed files count. Is this a count of audio files or also a count of images being processed?

2. The initial Similarity scan failed after processing about 15,000 files (computer blue screen of death and Similarity may not have been the cause). When I restarted Similarity, I saw no update of the cache counter. The second run that completed successfully appeared to be a little faster but that was probably because I raised the processing priority of Similarity and not due to any previously stored analysis results. I now have 30,000 files analysed but I am afraid to turn off Similarity because I don't know how to get the files into the cache. I don't want to have to rerun the complete analysis every time I start up Similarity. So how do I get the analysis stored for future viewing?

3. I selected 30,000 files as an initial run realizing the time processing constraints. I will now go in and remove bad albums, fix some file names and tags and do other file editing functions. I then want to add a new batch of albums, do a similar run against them. I want Similarity to review the already scanned files for changes, remove deleted files from its data base, re-analyse any that have been changed, ingore those that have not changed but leave them in the analyses result lists, and then proceed to analyse the newly added files, and add all changes and additions to the cache, for subsequent runs. After several runs, I expect that I should have a visible log analyses for viewing of all files in the collection. It is not clear from the response to the above forum entry, that Similarity can do this. Can it?

4. The Admin response to the forum question suggested that the processing limitations of Similarity are all due to all analyses results have to be maintained in memory resulting in slowing down and instability of the program. If that is the case, as a Similarity customer I would prefer being able to identify which files to include in the analysis (audio or image, audio and image) and whether or not a comparison should take place in real time or not. In my case I would turn off image processing and real-time comparison. This would result in only a few analysis results being in memory at a time (just enough to optimize disk writes), faster processing because all the program has to do is sequential read the file list, determine changes, deletions, additions, process as required, and write the results to disk. No need to do comparative analysis in this run. If the results are written to disk shortly after analyses is performed, and the database checkpointed properly, there would be no issue of loosing the results of a long run. Am I misinterpreting the Admins response? The suggestion is that I can't tell Similarity I don't want it to keep analysis results in memory and not doing a comparison during the analysis.
If a time saving is possible by turning off image and real-time comparison, I would offset the saving by adding the option to do a spectrum and sonogram analysis during the analysis scan, and adding that to the collection data base. The time and space to do this may not be acceptable to all Similarity users so they should be selectable options.


  • Administrator
  • Hero Member
  • *****
  • Posts: 624
    • View Profile
Re: Similarity efficiency (speed) and recovery/restart.
« Reply #1 on: October 17, 2012, 22:36:03 »
0. About Max and Mean (also dynamic range, they same), for listening it not always possible to hear difference, just simply if composition digitized with 16 bit, values in max < 0.5 (1.0 mean using full dynamic range) means that 1bit doesn't used (you have actually 15bit digitization), < 0.25 is 2 bits not used, < 0.125 is 3 bits and etc.
1. Similarity shows number of correctly processed files to number of audio, image files need to process and already processed.
2. It means that cache file corrupted and all data after corrupt just skipped, you can backup cache file (cache.dat in "%appdata%\Similarity" folder) after success.
3. You need use folder groups in Folder tab, mark old good folder as group #1 and newly added folders as group #2, Similarity only scans for duplicates between different groups not inside same group.
4. Please wait versin 1.8.0, next version of Similarity (that comes this month) will be more stable for crashes, main problem with stability is decoders, almost all of them is 3rd party and they can simply hang up or crash for many reasons, and after this they crash Similarity itself. In next version all decoders will be out of main process in separate processes and if they crash or hang up, they simply restarted without corruption of main process.


  • Jr. Member
  • **
  • Posts: 3
    • View Profile
Re: Similarity efficiency (speed) and recovery/restart.
« Reply #2 on: October 25, 2012, 21:45:27 »
1. So that means that the total files in the counter includes images. Which explains the large number. I didn't think that I had that many audio files.

2. Good suggestion to backup cache file. The only scan results that added entries to the cache accumulator number, were a single short ananlysis. All other long analysis runs either failed during execution (not necessarily Similarity caused) or when the program was shutdown after a long analysis (including the last one I mentioned in the above post).

3. I don't quite understand the use of groups but will investigate. Are you suggesting that if the file folders of a Similarity analysis run are not assigned to a group, no file comparisons take place. That is what I am trying to accomplish. I do not want an automated duplicate file check to take place during the analysis. I don't have a problem in Similarity getting a fingerprint of a few seconds of each song during the analysis and then storing the results in the db but I don't want the added overhead of the actual dup check processing during analysis. Those are things I would want to have the option of turning off. If I want a duplicate check, at a later date I want Similarity to use the initial analysis results to do the dup check.

4. I will look forward to any improvement in stability.

You did not answer two very important questions.

During the analysis (in this case if the file/folders are not assigned to a group) are the results stored in memory to perform immediate dup checks or are they periodically written to disk to make recovery easier/ use less memory / perform faster? The latter is what I would prefer and would want options to set those preferences.

The other.. Since the music db is constanly being maintained (e.g. audio tags changed, file and folder names changed, one track replaced by another, tracks deleted, etc.), does Similarity recognize these changes (at startup? or on request) and reanalyze only those files that have experienced a  change, or does Similarity have to reanalyze the whole data base, which I have indicated could take a significant amount of time?
« Last Edit: November 08, 2012, 23:58:11 by muse2u »