Setting thresholds for duplicates and outliers

Several threshold values control the operation of various parts of the fastdup run, and could be used for steering results towards a desired output. The most useful ones are listed below.

Most fastdup threshold are added as arguments to the .run() method. Thresholds may require adjustment for specific use cases, or when using your own model. They may change from one dataset to another, based on the specific needs.

For a complete list, see the V1 API - run() reference.

Threshold for Similarity clusters

Default: cc_threshold=0.96

The ccthreshold parameter controls the distance used for determining whether two image belong to the same similarity cluster, calculated using Connected components algorithm.

The best threshold depends on both data and use case:

  • A higher threshold would cluster images that are highly similar and even duplicates, but would cluster less images.
  • A lower threshold would cluster more similar images together, but clusters would have more diversity and a larger possible difference between images.

📘

Example - lowering the threshold to get more samples in each image similarity cluster

See in Cleaning and preparing a dataset

Similarity threshold

Default:threshold=0.9

The similarity threshold sets the minimal similarity value that is useful for a fastdup analysis. Two images that do not meet this threshold will not be joined together under any case, with the exception of outlier analysis.

A lower threshold will include more image pairs in the analyzed data, which could be used for similarity galleries.

k Nearest-Neighbors parameters

Default: nearest_neighbors_k=2

The nearest_neighbors_k parameter controls the amount of neighbors included for each image or object. Use a higher value for getting more results when using the similarity galleries. In some cases a higher value may also increase the nearest-neighbor recall and give better results for image cluster analysis. We have found the default value to be sufficient in the vast majority of cases, with higher values giving little to no added value.

Outliers threshold

Default:outlier_percentile=0.05

The outlier threshold sets the bottom percentile of similarity values to be regarded as outliers during analysis. For the default value of 0.05 (=5%), images having their nearest neighbor in the bottom 5%, meaning the image closest to them is farther away than 95% of other pair distances.

A lower value would yield less outliers but each would have a relatively stronger difference than the rest of the distribution.