V0.2xx API

This page holds the API reference for fastdup V0.2xx API, which is completely supported in the current releases, and includes a few features not yet covered in the V1.0 API.

fastdup

run

def run(input_dir='',
        work_dir='.',
        test_dir='',
        compute='cpu',
        verbose=False,
        num_threads=-1,
        num_images=0,
        turi_param='nnmodel=0',
        distance='cosine',
        threshold=0.9,
        lower_threshold=0.05,
        model_path=model_path_full,
        license='',
        version=False,
        nearest_neighbors_k=2,
        d=576,
        run_mode=0,
        nn_provider='nnf',
        min_offset=0,
        max_offset=0,
        nnf_mode="HNSW32",
        nnf_param="",
        bounding_box="",
        batch_size=1,
        resume=0,
        high_accuracy=False)

Run fastdup tool for finding duplicate, near duplicates, outliers and clusters of related images in a corpus of images.
The only mandatory argument is image_dir.

Arguments:

input_dir (str):
Location of the images/videos to analyze.

  • A folder

  • A remote folder (s3 or minio starting with s3:// or minio://). When using minio append the minio server name for example minio://google/visual_db/sku110k.

  • A file containing absolute filenames each on its own row.

  • A file containing s3 full paths or minio paths each on its own row.

  • A python list with absolute filenames.

  • A python list with absolute folders, all images and videos on those folders are added recusively

  • For run_mode=2, a folder containing fastdup binary features or a file containing list of atrain_feature.dat.csv files in multiple folders

  • yolov5 yaml input file containing train and test folders (single folder supported for now)

  • We support jpg, jpeg, tiff, tif, giff, heif, heic, bmp, png, mp4, avi. In addition we support tar, tar.gz, tgz and zip files containing images.
    If you have other image extensions that are readable by opencv imread() you can give them in a file (each image on its own row) and then we do not check for the
    known extensions and use opencv to read those formats.

  • Note - It is not possible to mix compressed (videos or tars/zips) and regular images. Use the flag turi_param='tar_only=1' if you want to ignore images and run from compressed files.

  • Note - We assume image sizes should be larger or equal to 10x10 pixels. Smaller images (either on width or on height) will be ignored with a warning shown.

  • Note - It is possible to skip small images also by defining minimum allowed file size using turi_param='min_file_size=1000' (in bytes).

  • Note - For performance reasons it is always preferred to copy s3 images from s3 to local disk and then run fastdup on local disk. Since copying images from s3 in a loop is very slow.
    Alternatively you can use the flag turi_param='sync_s3_to_local=1' to copy ahead all images on the remote s3 bucket to disk.

  • Note - fastdup plus beta version now supports bounding boxes on the c++ side. To use it prepare an input file with the following csv header: filename,col_x,row_y,width,height where each row as an image file
    and bounding box information in the above format. Fastdup will run on the bounding box level and the reports will be generated on the bounding box level. For using bounding boxes please sign up
    for our free beta program at https://visual-layer.com or send an email to [email protected].

  • work_dir str - Path for storing intermediate files and results.

  • test_dir str - Optional path for test data. When given similarity of train and test images is compared (vs. train/train or test/test which are not performed).
    The following options are supported.

    • test_dir can be a local folder path
    • An s3:// or minio:// path.
    • A python list with absolute filenames
    • A file containing absolute filenames each on its own row.
  • compute str - Compute type [cpu|gpu] Note: gpu is supported only in the enterprise version.

  • verbose boolean - Verbosity.

  • num_threads int - Number of threads. If no value is specified num threads is auto configured by the number of cores.

  • num_images unsigned long long - Number of images to run on. On default, run on all the images in the image_dir folder.

  • turi_param str - Optional turi parameters seperated by command. Example run: turi_param=ccthreshold=0.99'
    The following parameters are supported.

    • ccthreshold=xx, Threshold for running connected components to find clusters of similar images. Allowed values 0->1. The default ccthreshold is 0.96. This groups very similar images together, for example identical images or images that went
      simple transformations like scaling, flip, zoom in. As higher the score the more similar images are grouped by and you will get
      smaller clusters. Score 0.9 is pretty broad and will clsuter images together even if they fine details are not similar.
      It is recommended to experiment with this parameter based on your dataset and then visualize the results using fastdup.create_components_gallery().
    • run_cc=0|1 run connected components on the resulting similarity graph. Default is 1.
    • run_pagerank=0|1 run pagerank on the resulting similarity graph. Default is 1.
    • delete_tar=0|1 when working with tar files obtained from cloud storage delete the tar after download
    • delete_img=0|1 when working with images obtained from cloud storage delete the image after download
    • tar_only=0|1 run only on tar files and ignore images in folders. Default is 0.
    • run_stats=0|1 compute image statistics. Default is 1.
    • sync_s3_to_local=0|1 In case of using s3 bucket sync s3 to local folder to improve performance. Assumes there is enough local disk space to contain the dataDefault is 0.\
  • distance str - Distance metric for the Nearest Neighbors algorithm. The default is 'cosine' which works well in most cases.
    For nn_provider='nnf' the following distance metrics are supported.
    When using nnf_mode='Flat': 'cosine', 'euclidean', 'l1','linf','canberra','braycurtis','jensenshannon' are supported.
    Otherwise 'cosine' and 'euclidean' are supported.

  • threshold float - Similarity measure in the range 0->1, where 1 is totally identical, 0.98 and above is almost identical.

  • lower_threshold float - Similarity percentile measure to outline images that are far away (outliers) vs. the total distribution. (means 5% out of the total similarities computed).

  • model_path str - Optional location of ONNX model file, should not be used.

  • version bool - Print out the version number. This function takes no argument.

  • nearest_neighbors_k int - For each image, how many similar images to look for.

  • d int - Length of the feature vector. On default it is 576. When you use your own onnx model, change this parameter to the output model feature vector length.

    run_mode (int):
    0 (the default) does the feature extraction and NN embedding to compute all pairs similarities.
    It uses the input_dir command line argument for finding the directory to run on (or a list of files to run on).
    The features are extracted and saved into the working_dir path (the default features out file nme is
    features.dat in the same folder for storing the numpy features and features.dat.csv for storing the
    image file names corresponding to the numpy features).
    For larger dataset it may be wise to split the run into two, to make sure intermediate results are stored in case you encounter an error.

    1 computes the extracted features and stores them, does not compute the NN embedding.
    For large datasets, it is possible to run on a few computing nodes, to extract the features, in parallel.
    Use the min_offset and max_offset flags to allocate a subset of the images for each computing node.
    Offsets start from 0 to n-1 where n is the number of images in the input_dir folder.

    2 reads a stored feature file and computes the NN embedding to provide similarities.
    The input_dir param is ignored, and the work_dir is used to point to the numpy feature file. (Give a full path and filename).

    3 Reads the NN model stored by nnf.index from the work_dir and computes all pairs similarity on all images
    given by the test_dir parameter. input_dir should point to the location of the train data.
    This mode is used for scoring similarities on a new test dataset given a precomputed simiarity index on a train dataset.

    4 reads the NN model stored by nnf.index from the work_dir and computes all pairs similarity on pre extracted feature vectors computer by run_mode=1.\

  • min_offset unsigned long long - Optional min offset to start iterating on the full file list.

  • max_offset unsigned long long - Optional max offset to start iterating on the full file list.

  • nnf_mode str - When nn_provider='nnf' selects the nnf model mode.
    default is HSNW32. More accurate is Flat.

  • nnf_param str - Selects assigns optional parameters.
    num_em_iter, number of KMeans EM iterations to run. Default is 20.
    num_clusters, number of KMeans clusters to use. Default is 100.

  • bounding_box str - Optional bounding box to crop images, given as bounding_box='row_y=xx,col_x=xx,height=xx,width=xx'. This defines a global bounding box to be used for all images.
    Beta release features (need to sign up at https://visual-layer.com): Tt is possible to set bounding_box='face' to crop the face from the image (in case a face is present).
    In addition, you can set bounding_box='yolov5s' and we will run yolov5s to create and crop bounding boxes on your data. (We do not host this model, it is downloaded from the relevant github proejct).
    For the face/yolov5 crop the margin around the face is defined by turi_param='augmentation_horiz=0.2,augmentation_vert=0.2' where 0.2 mean 20% additional margin around the face relative to the width and height respectively.
    It is possible to change the margin, the lowest value is 0 (no margin) and upper allowed value is 1. Default is 0.2.

  • batch_size int - Optional batch size when computing inference. Allowed values < 200. Note: batch_size > 1 is enabled in the enterprise version.

  • resume int - Optional flag to resume from a previous run.

  • high_accuracy bool - Compute a more accurate model. Runtime is increased about 15% and feature vector storage size/ memory is increased about 60%. The upside is the model can distinguish better of minute details in images with many objects.

Returns:

  • ret int - Status code 0 = success, 1 = error.

run_on_webdataset

def run_on_webdataset(input_dir='',
                      work_dir='.',
                      test_dir='',
                      compute='cpu',
                      verbose=False,
                      num_threads=-1,
                      num_images=0,
                      turi_param='nnmodel=0',
                      distance='cosine',
                      threshold=0.9,
                      lower_threshold=0.05,
                      model_path=model_path_full,
                      license='',
                      version=False,
                      nearest_neighbors_k=2,
                      d=576,
                      nn_provider='nnf',
                      min_offset=0,
                      max_offset=0,
                      nnf_mode="HNSW32",
                      nnf_param="",
                      bounding_box="",
                      batch_size=1)

Run the FastDup software on a web dataset.
This run is composed of two stages. First extract all feature vectors using run_mode=1, then run the nearest neighbor model using run_mode=2.
Make sure that work_dir has enough free space for extracting tar files. Tar files are extracted temporarily into work_dir/tmp folder.
You can control the free space using the flags turi_param='delete_tar=1|0' and delete_img='1|0'. When delete_tar=1 the tars are processed one by one and deleted after processing.
When delete_img=1 the images are processed one by one and deleted after processing.

load_binary_feature

def load_binary_feature(filename, d=576)

Python function for loading the stored binary features written by fastdup and their matching filenames and analyzing them in Python.

Arguments:

  • filename str - The binary feature file location
  • d int - Feature vector length

Returns:

  • filenames list - A list of with all image file names of length X.
  • np_array np.array - An np matrix of shape rows x d cols (default d is 576). Each row conform to feature vector os a single image.

Example:

import fastdup
file_list, mat_features = fastdup.load_binary(FILENAME_FEATURES)

save_binary_feature

def save_binary_feature(save_path, filenames, np_array)

Function for saving data to be used by fastdup. Given a list of images and their matching feature vectors in a numpy array,
function saves data in a format readable by fastdup. This saves the image extraction step, to be used with run_mode=1 namely perform
nearest neighbor model on the feature vectors.

Arguments:

  • save_path str - Working folder to save the files to
  • filenames list - A list of file location of the images (absolute paths) of length n images
  • np_array np.array - Numpy array of size n x d. Each row is a feature vector of one file.

Returns:

  • ret int - 0 in case of success, otherwise 1

create_duplicates_gallery

def create_duplicates_gallery(similarity_file,
                              save_path,
                              num_images=20,
                              descending=True,
                              lazy_load=False,
                              get_label_func=None,
                              slice=None,
                              max_width=None,
                              get_bounding_box_func=None,
                              get_reformat_filename_func=None,
                              get_extra_col_func=None,
                              input_dir=None,
                              work_dir=None,
                              threshold=None,
                              **kwargs)

Function to create and display a gallery of duplicate/near duplicate images as computed by the similarity metric.

In addition, it is possible to compute hierarchical gallery of duplicate/near duplicate clusters. For doing so need to
(A) Run fastdup to compute similarity on work_dir
(B) Run connected components on the work_dir saving the component results to save_path (need to run with lazy_load=True)
(C) Run create_duplicates_gallery() on the components to find pairs of similar components. Point the similarity_file to similarity_hierarchical_XX.csv file where XX is the
connected components threshold (ccthreshold=XX).

Example:

import fastdup
fastdup.run('input_folder', 'output_folder')
fastdup.create_duplicates_gallery('output_folder', save_path='.', get_label_func = lambda x: x.split('/')[1], slice='hamburger')

Regarding get_label_func, this example assumes that the second folder name is the class name for example my_data/hamburger/image001.jpg. You can change it to match your own labeling convention.

Arguments:

  • similarity_file str - csv file with the computed similarities by the fastdup tool, or a work_dir path, or a pandas dataframe containing the similarities.
  • save_path str - output folder location for the visuals
  • num_images int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.
  • descending boolean - If False, print the similarities from the least similar to the most similar. Default is True.
  • lazy_load boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • slice str - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
    slice could be a specific label i.e. slice='haumburger' and in that case only similarities between hamburger and other classes are presented.
    Two reserved arguments for slice are "diff" and "same". When using "diff" the report only shows similarities between classes. When using "same" the report will show only similarities inside same class.
    Note that when using slice, the function get_label_function should be implmeneted.
  • max_width int - Optional parameter to set the max width of the gallery.
  • get_bounding_box_func callable - Optional parameter to allow plotting bounding boxes on top of the image.
    The input is an absolute path to the image and the output is a list of bounding boxes.
    Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
    Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
    Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists
  • get_reformat_filename_func callable - Optional parameter to allow changing the presented filename into another string.
    The input is an absolute path to the image and the output is the string to display instead of the filename.
  • get_extra_col_func callable - Optional parameter to allow adding additional column to the report
  • input_dir str - Optional parameter to specify the input directory of webdataset tar files,
    in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'
  • work_dir str - Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a duplicate file path
  • threshold float - Optional parameter to specify the threshold for similarity score to be considered as duplicate. Values above the threshold will be considered as duplicate.
    Allowed values are between 0 and 1.
  • save_artifacts boolean - Optional parameter to allow saving the intermediate artifacts (raw images, csv with results) to the output folder

create_duplicate_videos_gallery

def create_duplicate_videos_gallery(similarity_file,
                                    save_path,
                                    num_images=20,
                                    descending=True,
                                    lazy_load=False,
                                    get_label_func=None,
                                    slice=None,
                                    max_width=None,
                                    get_bounding_box_func=None,
                                    get_reformat_filename_func=None,
                                    get_extra_col_func=None,
                                    input_dir=None,
                                    work_dir=None,
                                    threshold=None,
                                    **kwargs)

Function to create and display a gallery of duplicaate videos computed by the similarity metrics

Example:

import fastdup
fastdup.run('input_folder', 'output_folder', run_mode=1) # extract frames from videos
fastdup.run('input_folder', 'output_folder', run_mode=2) # run fastdup
fastdup.create_duplicates_videos_gallery('output_folder', save_path='.')

Arguments:

  • similarity_file str - csv file with the computed similarities by the fastdup tool, or a work_dir path, or a pandas dataframe containing the similarities.
  • save_path str - output folder location for the visuals
  • num_images int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.
  • descending boolean - If False, print the similarities from the least similar to the most similar. Default is True.
  • lazy_load boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • slice str - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
    slice could be a specific label i.e. slice='haumburger' and in that case only similarities between hamburger and other classes are presented.
    Two reserved arguments for slice are "diff" and "same". When using "diff" the report only shows similarities between classes. When using "same" the report will show only similarities inside same class.
    Note that when using slice, the function get_label_function should be implmeneted.
  • max_width int - Optional parameter to set the max width of the gallery.
  • get_bounding_box_func callable - Optional parameter to allow plotting bounding boxes on top of the image.
    The input is an absolute path to the image and the output is a list of bounding boxes.
    Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
    Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
    Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists
  • get_reformat_filename_func callable - Optional parameter to allow changing the presented filename into another string.
    The input is an absolute path to the image and the output is the string to display instead of the filename.
  • get_extra_col_func callable - Optional parameter to allow adding additional column to the report
  • input_dir str - Optional parameter to specify the input directory of webdataset tar files,
    in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'
  • work_dir str - Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a duplicate file path
  • threshold float - Optional parameter to specify the threshold for similarity score to be considered as duplicate. Values above the threshold will be considered as duplicate.
    Allowed values are between 0 and 1.
  • save_artifacts boolean - Optional parameter to allow saving the intermediate artifacts (raw images, csv with results) to the output folder

create_outliers_gallery

def create_outliers_gallery(outliers_file,
                            save_path,
                            num_images=20,
                            lazy_load=False,
                            get_label_func=None,
                            how='one',
                            slice=None,
                            max_width=None,
                            get_bounding_box_func=None,
                            get_reformat_filename_func=None,
                            get_extra_col_func=None,
                            input_dir=None,
                            work_dir=None,
                            **kwargs)

Function to create and display a gallery of images computed by the outliers metrics.
Outliers are computed using the fastdup tool, by embedding each image to a short feature vector, finding top k similar neighbors
and finding images that are further away from all other images, i.e. outliers.
On default fastdup saves the outliers into a file called outliers.csv inside the work_dir folder.
It is possible to load this file using pandas to get the list of outlir images.
Note that the number of images included in the outliers file depends on the lower_threshold parameter in the fastdup run. This command line argument is a percentile
i.e. 0.05 means top 5% of the images that are further away from the rest of the images are considered outliers.

Arguments:

  • outliers_file str - csv file with the computed outliers by the fastdup tool, or a work_dir path, or a pandas dataframe contraining the outliers
  • save_path str - output folder location for the visuals
  • num_images int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.
  • lazy_load boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • how str - Optional outlier selection method. one = take the image that is far away from any one image (but may have other images close to it).
    all = take the image that is far away from all other images. Default is one.
  • slice str - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
  • max_width int - Optional parameter to set the max width of the gallery.
  • get_bounding_box_func callable - Optional parameter to allow plotting bounding boxes on top of the image.
    The input is an absolute path to the image and the output is a list of bounding boxes.
    Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
    Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
    Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists
  • get_reformat_filename_func callable - Optional parameter to allow changing the presented filename into another string.
    The input is an absolute path to the image and the output is the string to display instead of the filename.
  • get_extra_col_func callable - Optional parameter to allow adding additional column to the report
  • input_dir str - Optional parameter to specify the input directory of webdataset tar files,
    in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'
  • work_dir str - Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a outliers file path

create_components_gallery

def create_components_gallery(work_dir,
                              save_path,
                              num_images=20,
                              lazy_load=False,
                              get_label_func=None,
                              group_by='visual',
                              slice=None,
                              max_width=None,
                              max_items=None,
                              get_bounding_box_func=None,
                              get_reformat_filename_func=None,
                              get_extra_col_func=None,
                              threshold=None,
                              metric=None,
                              descending=True,
                              min_items=None,
                              keyword=None,
                              input_dir=None,
                              **kwargs)

Function to create and display a gallery of images for the largest graph components

Arguments:

  • work_dir str - path to fastdup work_dir, or a path to connected component csv file. Altenatively dataframe with connected_compoennts.csv content from previous fastdup run.
  • save_path str - output folder location for the visuals
  • num_images int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.
  • lazy_load boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • group_by str - [visual|label]. Group the report using the visual properties of the image or using the labels of the images. Default is visual.
  • slice str or list - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
  • max_width int - Optional parameter to set the max html width of images in the gallery. Default is None.
  • max_items int - Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.
  • get_bounding_box_func callable - Optional parameter to allow plotting bounding boxes on top of the image.
    The input is an absolute path to the image and the output is a list of bounding boxes.
    Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
    Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
    Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists
  • get_reformat_filename_func callable - Optional parameter to allow changing the presented filename into another string. The input is an absolute path to the image and the output is the string to display instead of the filename.
  • get_extra_col_func callable - Optional parameter to allow adding more information to the report.
  • threshold float - Optional parameter to set the treshold for chosing components. Default is None.
  • metric str - Optional parameter to set the metric to use (like blur) for chose components. Default is None.
  • descending boolean - Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.
  • min_items int - Optional parameter to select components with min_items or more items. Default is None.
  • keyword str - Optional parameter to select components with keyword asa subset of the label. Default is None.
  • input_dir str - Optional parameter to specify the input directory of webdataset tar files,
    in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'
  • kwargs dict - Optional parameter to pass additional parameters to the function.
  • split_sentence_to_label_list boolean - Optional parameter to split the label into a list of labels. Default is False.
  • limit_labels_printed int - Optional parameter to limit the number of labels printed in the html report. Default is max_items.
  • nrows int - limit the number of read rows for debugging purposes of the report
  • save_artifacts bool - Optional param to save intermediate artifacts like image paths used for generating the component

Returns:

  • ret int - 0 in case of success, otherwise 1

create_component_videos_gallery

def create_component_videos_gallery(work_dir,
                                    save_path,
                                    num_images=20,
                                    lazy_load=False,
                                    get_label_func=None,
                                    group_by='visual',
                                    slice=None,
                                    max_width=None,
                                    max_items=None,
                                    get_bounding_box_func=None,
                                    get_reformat_filename_func=None,
                                    get_extra_col_func=None,
                                    threshold=None,
                                    metric=None,
                                    descending=True,
                                    min_items=None,
                                    keyword=None,
                                    input_dir=None,
                                    **kwargs)

Function to create and display a gallery of similar videos based on the graph components

Arguments:

  • work_dir str - path to fastdup work_dir
  • save_path str - output folder location for the visuals
  • num_images int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.
  • lazy_load boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • group_by str - [visual|label]. Group the report using the visual properties of the image or using the labels of the images. Default is visual.
  • slice str or list - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
  • max_width int - Optional parameter to set the max html width of images in the gallery. Default is None.
  • max_items int - Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.
  • get_bounding_box_func callable - Optional parameter to allow plotting bounding boxes on top of the image.
    The input is an absolute path to the image and the output is a list of bounding boxes.
    Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
    Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
    Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists
  • get_reformat_filename_func callable - Optional parameter to allow changing the presented filename into another string. The input is an absolute path to the image and the output is the string to display instead of the filename.
  • get_extra_col_func callable - Optional parameter to allow adding more information to the report.
  • threshold float - Optional parameter to set the treshold for chosing components. Default is None.
  • metric str - Optional parameter to set the metric to use (like blur) for chose components. Default is None.
  • descending boolean - Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.
  • min_items int - Optional parameter to select components with min_items or more items. Default is None.
  • keyword str - Optional parameter to select components with keyword asa subset of the label. Default is None.
  • input_dir str - Optional parameter to specify the input directory of webdataset tar files,
    in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

Returns:

  • ret int - 0 in case of success, otherwise 1

create_kmeans_clusters_gallery

def create_kmeans_clusters_gallery(work_dir,
                                   save_path,
                                   num_images=20,
                                   lazy_load=False,
                                   get_label_func=None,
                                   slice=None,
                                   max_width=None,
                                   max_items=None,
                                   get_bounding_box_func=None,
                                   get_reformat_filename_func=None,
                                   get_extra_col_func=None,
                                   threshold=None,
                                   metric=None,
                                   descending=True,
                                   min_items=None,
                                   keyword=None,
                                   input_dir=None,
                                   **kwargs)

Function to visualize the kmeans clusters.

Arguments:

  • work_dir str - path to fastdup work_dir
  • save_path str - output folder location for the visuals
  • num_images int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.
  • lazy_load boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • slice str or list - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
  • max_width int - Optional parameter to set the max html width of images in the gallery. Default is None.
  • max_items int - Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.
  • get_bounding_box_func callable - Optional parameter to allow plotting bounding boxes on top of the image.
    The input is an absolute path to the image and the output is a list of bounding boxes.
    Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
    Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
    Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists
  • get_reformat_filename_func callable - Optional parameter to allow changing the presented filename into another string. The input is an absolute path to the image and the output is the string to display instead of the filename.
  • get_extra_col_func callable - Optional parameter to allow adding more information to the report.
  • threshold float - Optional parameter to set the treshold for chosing components. Default is None.
  • metric str - Optional parameter to set the metric to use (like blur) for chose components. Default is None.
  • descending boolean - Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.
  • min_items int - Optional parameter to select components with min_items or more items. Default is None.
  • keyword str - Optional parameter to select components with keyword asa subset of the label. Default is None.
  • input_dir str - Optional parameter to specify the input directory of webdataset tar files,
    in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

Returns:

  • ret int - 0 in case of success, otherwise 1

remove_duplicates

def remove_duplicates(input_dir, work_dir=None, distance=0.96, dry_run=True)

function to automate deletion of duplicate images, where similarity is above the distance param.
Arguments:

  • input_dir see fastdup.run()
  • work_dir see fastdup.run()
  • distance delete only components which has similarity larger than min_distance.
  • dry_run bool - if True does not delete but print the rm commands used, otherwise deletes
    Returns:
  • ret list - list of deleted files

delete_components

def delete_components(top_components, to_delete=None, min_distance=0.96, how='one', dry_run=True)

function to automate deletion of duplicate images using the connected components analysis.

Example:

import fastdup
fastdup.run('/path/to/data', '/path/to/output')
top_components = fastdup.find_top_components('/path/to/output')

top_components = top_components[top_components['distance'] > 0.99] # remove any components which are similar more than 0.99
delete_components(top_components, None, how = 'one', dry_run = False)

Arguments:

  • top_components pd.DataFrame - largest components as found by the function find_top_components().
  • to_delete list - a list of integer component ids to delete. On default None which means delete duplicates from all components.
  • min_distance delete only components which has similarity larger than min_distance.
  • how int - either 'all' (deletes all the component) or 'one' (leaves one image and delete the rest of the duplicates)
  • dry_run bool - if True does not delete but print the rm commands used, otherwise deletes

Returns:

  • ret list - list of deleted files

delete_components_by_label

def delete_components_by_label(top_components_file,
                               min_items=10,
                               min_distance=0.96,
                               how='one',
                               dry_run=True)

function to automate deletion of duplicate images using the connected components analysis.

Arguments:

  • top_components pd.DataFrame - largest components as found by the function find_top_components().
  • to_delete list - a list of integer component ids to delete
  • min_distance delete only components which has similarity larger than min_distance.
  • how int - either 'all' (deletes all the component) or 'majority' (leaves one image with the dominant label count and delete the rest)
  • dry_run bool - if True does not delete but print the rm commands used, otherwise deletes

Returns:

  • ret list - list of deleted files

delete_or_retag_stats_outliers

def delete_or_retag_stats_outliers(stats_file,
                                   metric,
                                   filename_col='filename',
                                   label_col=None,
                                   lower_percentile=None,
                                   upper_percentile=None,
                                   lower_threshold=None,
                                   upper_threshold=None,
                                   get_reformat_filename_func=None,
                                   dry_run=True,
                                   how='delete',
                                   save_path=None,
                                   work_dir=None)

function to automate deletion of outlier files based on computed statistics.

Example:

import fastdup
fastdup.run('/my/data/", work_dir="out")
delete 5% of the brightest images and delete 2% of the darkest images
fastdup.delete_or_retag_stats_outliers("out", metric="mean", lower_percentile=0.05, dry_run=False)

It is recommended to run with dry_run=True first, to see the list of files deleted before actually deleting.

Example:

This example first find wrong labels using similarity gallery and then deletes anything with score < 51.
Score is in range 0-100 where 100 means this image is similar only to images from the same class label.
Score 0 means this image is only similar to images from other class labels.

import fastdup
df2 = create_similarity_gallery(..., get_label_func=...)
fastdup.delete_or_retag_stats_outliers(df2, metric='score', filename_col = 'from', lower_threshold=51, dry_run=True)

  • Note - it is possible to run with both lower_percentile and upper_percentile at once. It is not possible to run with lower_percentile and lower_threshold at once since they may be conflicting.

Arguments:

stats_file (str):

  • folder pointing to fastdup workdir or
  • file pointing to work_dir/atrain_stats.csv file or
  • pandas DataFrame containing list of files giveb in the filename_col column and a metric column.
  • metric str - statistic metric, should be one of "blur", "mean", "min", "max", "stdv", "unique", "width", "height", "size"
  • filename_col str - column name in the stats_file to use as the filename
  • lower_percentile float - lower percentile to use for the threshold. Values are 0->1, where 0.05 means remove 5% of the lowest values.
  • upper_percentile float - upper percentile to use for the threshold. Values are 0->1, where 0.95 means remove 5% of the upper values.
  • lower_threshold float - lower threshold to use for the threshold. Only used if lower_percentile is None.
  • upper_threshold float - upper threshold to use for the threshold. Only used if upper_percentile is None.
  • get_reformat_filename_func callable - Optional parameter to allow changing the filename into another string. Useful in the case fastdup was run on a different folder or machine and you would like to delete files in another folder.
  • dry_run bool - if True does not delete but print the rm commands used, otherwise deletes
  • how str - either 'delete' or 'move' or 'retag'. In case of retag allowed value is retag=labelImg or retag=cvat
  • save_path str - optional. In case of a folder and how == 'retag' the label files will be moved to this folder.
  • work_dir str - optional. In case of stats dataframe, point to fastdup work_dir.

Returns:

  • ret list - list of deleted files (or moved or retagged files)

export_to_tensorboard_projector

def export_to_tensorboard_projector(work_dir,
                                    log_dir,
                                    sample_size=900,
                                    sample_method='random',
                                    with_images=True,
                                    get_label_func=None,
                                    d=576,
                                    file_list=None)

Export feature vector embeddings to be visualized using tensorboard projector app.

Example:

import fastdup
fastdup.run('/my/data/', work_dir='out')
fastdup.export_to_tensorboard_projector(work_dir='out', log_dir='logs')

After data is exporeted run tensorboard projector

%load_ext tensorboard
%tensorboard --logdir=logs

Arguments:

  • work_dir str - work_dir where fastdup results are stored
  • log_dir str - output dir where tensorboard will read from
  • sample_size int - how many images to view. Default is 900.
  • sample_method str - how to sample, currently 'random' is supported.
  • with_images bool - add images to the visualization (default True)
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • d int - dimension of the embedding vector. Default is 576.
  • file_list list - Optional parameter to specify a list of files to be used for the visualization. If not specified, filenames are taken from the work_dir/atrain_features.dat.csv file
  • Note - be careful here as the order of the file_list matters, need to keep the exact same order as the atrain_features.dat.csv file!

Returns:

  • ret int - 0 in case of success, 1 in case of failure

generate_sprite_image

def generate_sprite_image(img_list,
                          sample_size,
                          log_dir,
                          get_label_func=None,
                          h=0,
                          w=0,
                          alternative_filename=None,
                          alternative_width=None,
                          max_width=None)

Generate a sprite image of images for tensorboard projector. A sprite image is a large image composed of grid of smaller images.

Arguments:

  • img_list list - list of image filenames (full path)
  • sample_size int - how many images in to plot
  • log_dir str - directory to save the sprite image
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • h int - optional requested hight of each subimage
  • w int - optional requested width of each subimage
  • alternative_filename str - optional parameter to save the resulting image to a different name
  • alternative_width int - optional parameter to control the number of images per row
  • max_width int - optional parameter to control the rsulting width of the image

Returns:

  • path str - path to sprite image
  • labels list - list of labels

find_top_components

def find_top_components(work_dir,
                        get_label_func=None,
                        group_by='visual',
                        slice=None,
                        threshold=None,
                        metric=None,
                        descending=True,
                        min_items=None,
                        max_items=None,
                        keyword=None,
                        save_path=None,
                        comp_type="component",
                        **kwargs)

Function to find the largest components of duplicate images

Arguments:

  • work_dir str - working directory where fastdup.run was run.
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • group_by str - optional parameter to group by 'visual' or 'label'. When grouping by visual fastdup aggregates visually similar images together.
    When grouping by 'label' fastdup aggregates images with the same label together.
  • slice str - optional parameter to slice the results by a specific label. For example, if you want to slice by 'car' then pass 'car' as the slice parameter.
  • threshold float - optional threshold to select only distances larger than the treshold
  • metric str - optional metric to sort by. Valid values are mean,min,max,unique,blur,size
  • descending bool - optional value to sort the components, default is True
  • min_items int - optional value, select only components with at least min_items
  • max_items int - optional value, select only components with at most max_items
  • keyword str - optional, select labels with keyword value inside
  • save_path str - optional, save path
  • comp_type str - optional, either component or cluster

Returns:

  • df pd.DataFrame - of top components. The column component_id includes the component name.
    The column files includes a list of all image files in this component.

init_search

def init_search(k, work_dir, d=576, model_path=model_path_full, verbose=False)

Initialize real time search and precomputed nnf data.
This function should be called only once before running searches. The search function is search().

Arguments:

  • k int - number of nearest neighbors to search for
  • work_dir str - working directory where fastdup.run was run.
  • d int - dimension of the feature vector. Default is 576.
  • model_path str - path to the onnx model file. Optional.
  • verbose bool - (Optional): True for verbose mode.

Example:

import fastdup
input_dir = "/my/input/dir"
work_dir = "/my/work/dir"
fastdup.run(input_dir, work_dir)
        
# point to the work_dir where fastdup was run
fastdup.init_search(10, work_dir, verbose=True)

# The below code can be executed multiple times, each time with a new searched image
df = fastdup.search("myimage.jpg", None, verbose=True)

# optional: display search output
fastdup.create_duplicates_gallery(df, ".", input_dir=input_dir)

Note: fastdup model was trained with Image resize via Resampling.NEAREST and the BGR channel swapped to RGB.
In case you use other models, need to check their requirements.

Returns:

  • ret int - 0 in case of success, otherwise 1.

search

def search(img, size, verbose=0)

Search for similar images in the image database.

Arguments:

  • img str - the image to search for
  • size int - image size width x height
  • verbose int - run in verbose mode

Returns:

  • ret int - 0 = in case of success, 1 = in case of failure
    The output file is created on work_dir/similrity.csv as initialized by init_search

create_stats_gallery

def create_stats_gallery(stats_file,
                         save_path,
                         num_images=20,
                         lazy_load=False,
                         get_label_func=None,
                         metric='blur',
                         slice=None,
                         max_width=None,
                         descending=False,
                         get_bounding_box_func=None,
                         get_reformat_filename_func=None,
                         get_extra_col_func=None,
                         input_dir=None,
                         work_dir=None,
                         **kwargs)

Function to create and display a gallery of images computed by the statistics metrics.
Supported metrics are: mean (color), max (color), min (color), stdv (color), unique (number of unique colors), bluriness (computed by the variance of the laplpacian method
see https://theailearner.com/2021/10/30/blur-detection-using-the-variance-of-the-laplacian-method/.
The metrics are created by fastdup.run() and stored into the work_dir into a file named atrain_stats.csv. Note that the metrics are computed
on the fly fastdup loads and resizes every image only once.

Arguments:

  • stats_file str - csv file with the computed image statistics by the fastdup tool, alternatively a pandas dataframe. Default stats file is saved by fastdup.run() into the folder work_dir as atrain_stats.csv.
  • save_path str - output folder location for the visuals
  • num_images int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.
  • lazy_load boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • metric str - Optional metric selection. Supported metrics are:
    • width - of original image before resize
    • height - of original image before resize
    • size - area
    • file_size - file size in bytes
    • blur - variance of the laplacian
    • unique - number of unique colors, 0..255
    • mean - mean color 0..255
    • max - max color 0..255
    • min - min color 0..255
      Advanced metris include (for running advanced metrics, run with turi_param='run_advanced_stats=1')
    • contrast
    • rms_contrast - square root of mean sum of stdv/mean per channel
    • mean_rel_intensity_r
    • mean_rel_intensity_b
    • mean_rel_intensity_g
    • mean_hue - transform to HSV and compute mean H
    • mean_saturation - transform to HSV and compute mean S
    • mean_val - transform to HSV and compute mean V
    • edge_density - using canny filter
    • mean_r - mean of R channel
    • mean_g - mean of G channel
    • mean_b - mean of B channel
  • slice str - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
  • max_width int - Option parameter to select the maximal image width in the report
  • descending bool - Optional parameter to control the order of the metric
  • get_bounding_box_func callable - Optional parameter to allow plotting bounding boxes on top of the image.
    The input is an absolute path to the image and the output is a list of bounding boxes.
    Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
    Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
    Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists
  • get_reformat_filename_func callable - Optional parameter to allow changing the presented filename into another string.
    The input is an absolute path to the image and the output is the string to display instead of the filename.
  • get_extra_col_func callable - Optional parameter to allow adding extra columns to the gallery.
  • input_dir str - Optional parameter to specify the input directory of webdataset tar files,
    in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'
  • work_dir str - Optional parameter to fastdup work_dir. Needed when stats file is a pd.DataFrame.

Returns:

  • ret int - 0 in case of success, otherwise 1.

create_similarity_gallery

def create_similarity_gallery(similarity_file,
                              save_path,
                              num_images=20,
                              lazy_load=False,
                              get_label_func=None,
                              slice=None,
                              max_width=None,
                              descending=False,
                              get_bounding_box_func=None,
                              get_reformat_filename_func=None,
                              get_extra_col_func=None,
                              input_dir=None,
                              work_dir=None,
                              min_items=2,
                              max_items=None,
                              **kwargs)

Function to create and display a gallery of images computed by the similarity metric. In each table row one query image is
displayed and num_images most similar images are displayed next to it on the right.

In case the dataset is labeled, the user can specify the label using the function get_label_func. In this case a score metric is computed to reflect how similar the query image to the most similar images in terms of class label.
Score 100 means that out of the top k num_images similar images, all similar images are from the same class. Score 0 means that the image is similar only to images which are from different class.
Score 50 means that the query image is similar to the same number of images from the same class and from other classes. The report is sorted by the score metric.
For high quality labeled dataset we expect the score to be high, low score may indicate class label issues.

Arguments:

  • similarity_file str - csv file with the computed image statistics by the fastdup tool, or a path to the work_dir,
    alternatively a pandas dataframe. In case of a pandas dataframe need to set work_dir to point to fastdup work_dir.
  • save_path str - output folder location for the visuals
  • num_images int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.
  • lazy_load boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • slice str - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
    A special value is 'label_score' which is used for comparing both images and labels of the nearest neighbors. The score values are 0->100 where 0 means the query image is only similar to images outside its class, 100 means the query image is only similar to images from the same class.
  • max_width int - Optional param to limit the image width
  • descending bool - Optional param to control the order of the metric
  • get_bounding_box_func callable - Optional parameter to allow plotting bounding boxes on top of the image.
    The input is an absolute path to the image and the output is a list of bounding boxes.
    Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
    Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
    Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists
  • get_reformat_filename_func callable - Optional parameter to allow changing the presented filename into another string.
  • get_extra_col_func callable - Optional parameter to allow adding extra columns to the report
  • input_dir str - Optional parameter to specify the input directory of webdataset tar files,
    in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'
  • work_dir str - Optional parameter to fastdup work_dir. Needed when similarity_file is a pd.DataFrame.
  • min_items int - Optional parameter to select components with min_items or more
  • max_items int - Optional parameter to limit the number of items displayed

Returns:

  • ret pd.DataFrame - similarity dataframe, for each image filename returns a list of top K similar images.
    each row has the columns 'from', 'to', 'label' (optional), 'distance'

create_aspect_ratio_gallery

def create_aspect_ratio_gallery(stats_file,
                                save_path,
                                get_label_func=None,
                                lazy_load=False,
                                max_width=None,
                                num_images=0,
                                slice=None,
                                get_filename_reformat_func=None,
                                input_dir=None,
                                **kwargs)

Function to create and display a gallery of aspect ratio distribution.

Arguments:

  • stats_file str - csv file with the computed image statistics by the fastdup tool, or work_dir path or a pandas dataframe with the stats compouted by fastdup.
  • save_path str - output folder location for the visuals
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • lazy_load boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).
  • max_width int - optional parameter to limit the plot image width
  • num_images int - optional number of images to compute the statistics on (default computes on all images)
  • slice str - optional parameter to slice the stats file based on a specific label or a list of labels.
  • get_filename_reformat_func callable - optional function to reformat the filename before displaying it.
  • input_dir str - Optional parameter to specify the input directory of webdataset tar files,
    in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

Returns:

  • ret int - 0 in case of success, otherwise 1.

export_to_cvat

def export_to_cvat(files, labels, save_path)

Function to export a collection of files that needs to be annotated again to cvat batch job format.
This creates a file named fastdup_label.zip in the directory save_path.
The files can be retagged in cvat using Tasks -> Add (plus button) -> Create from backup -> choose the location of the fastdup_label.zip file.

Arguments:

files (str):
labels (str):
save_path (str):

Returns:

  • ret int - 0 in case of success, otherwise 1.

export_to_labelImg

def export_to_labelImg(files, labels, save_path)

Function to export a collection of files that needs to be annotated again to cvat batch job format.
This creates a file named fastdup_label.zip in the directory save_path.
The files can be retagged in cvat using Tasks -> Add (plus button) -> Create from backup -> choose the location of the fastdup_label.zip file.

Arguments:

files (str):
labels (str):
save_path (str):

Returns:

  • ret int - 0 in case of success, otherwise 1.

top_k_label

def top_k_label(labels_col,
                distance_col,
                k=10,
                threshold=None,
                min_count=None,
                unknown_class=None)

Function to classify examples based on their label using the top k nearest neighbors.
Decision is made by accounting for the majority of the neighbors.

Arguments:

  • labels_col list - list of labels
  • distance_col list - list of distances
  • k int - optional parameter
  • threshold float - optional parameter to consder neighbors with simiarity larger than threshold
  • min_count int - optional parameter to consider only examples with at least min_count neighbors with the same label
  • unknown_class - optional parameter to add decisions to unknown class in cases there is no majority

Returns:

computed label

create_knn_classifier

def create_knn_classifier(work_dir, k, get_label_func, threshold=None)

Function to create a knn classifier out of fastdup run. We assume there are existing labels to the datapoints.

Arguments:

  • work_dir str - fastdup work_dir, or location of a similarity file, or a pandas DataFrame with the computed similarities
  • k int - (unused)
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • threshold float - optional threshold to consider neighbors with similarity larger than threshold
    prediction per image to one of the given classes.

Returns:

  • df pd.DataFrame - List of predictions using knn method

create_kmeans_classifier

def create_kmeans_classifier(work_dir, k, get_label_func, threshold=None)

Function to create a knn classifier out of fastdup run. We assume there are existing labels to the datapoints.

Arguments:

  • work_dir str - fastdup work_dir, or location of a similarity file, or a pandas DataFrame with the computed similarities
  • k int - (unused)
  • get_label_func callable - optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
  • threshold float - (unused)

Returns:

  • df pd.DataFrame - dataframe with filename, label and predicted label. Row per each image

run_kmeans

def run_kmeans(input_dir='',
               work_dir='.',
               verbose=False,
               num_clusters=100,
               num_em_iter=20,
               num_threads=-1,
               num_images=0,
               model_path=model_path_full,
               license='',
               nearest_neighbors_k=2,
               d=576,
               bounding_box="",
               high_accuracy=False)

Run KMeans algorithm on a folder of images given by input_dir and save the results to work_dir.
Fastdup will extract feature vectors using the model specified by model_path and then run KMeans to cluster the vectors.
The results will be saved to work_dir in the following format:

  • kmeans_centroids.csv: a csv file containing the centroids of the clusters.
  • kmeans_assignments.csv: assignment of each data point to the closet centroids (number of centroids given by nearest_neighbors_k).
    After running kmeans you can use create_kmeans_clusters_gallery to view the results.

Arguments:

  • input_dir str - path to the folder containing the images to be clustered. See fastdup.run for more details.
  • work_dir str - path to the folder where the results will be saved.
  • verbose bool - verbosity level, default False
  • num_clusters int - Number of KMeans clusters to use
  • num_em_iter int - Number of em iterations
  • num_threads int - Number of threads for performing the feature vector extraction
  • num_images int - Limit the number of images
  • model_path str - Model path for the model to be used for feature vector extraction
  • license str - License string
  • nearest_neighbors_k int - When assigning an image into a cluster, how many clusters to assign to (starting from the closest)
  • d int - Dimension of the feature vector
  • bounding_box str - Optional bounding box see fastdup:::run for more details
  • high_accuracy bool - Use higher accuracy model for the feature extraction

Returns:

  • ret int - 0 in case of success, 1 in case of error

run_kmeans_on_extracted

def run_kmeans_on_extracted(input_dir='',
                            work_dir='.',
                            verbose=False,
                            num_clusters=100,
                            num_em_iter=20,
                            num_threads=-1,
                            num_images=0,
                            model_path=model_path_full,
                            license='',
                            nearest_neighbors_k=2,
                            d=576)

Run KMeans algorithm on a folder of extracted feature vectors (created on default when running fastdup:::run).
The results will be saved to work_dir in the following format:

  • kmeans_centroids.csv: a csv file containing the centroids of the clusters. In each row one centroid. In total num_clusters rows.
  • kmeans_assignments.csv: assignment of each data point to the closet centroids (number of centroids given by nearest_neighbors_k). In each row the image filename is listed, centoid id (starting from zero) and the L2 distance to the centroid.
    After running kmeans you can use fastdup:::create_kmeans_clusters_gallery to view the results.

Arguments:

  • input_dir str - path to the folder containing the images to be clustered. See fastup:::run for more details.
  • work_dir str - path to the folder where the results will be saved.
  • verbose bool - verbosity level, default False
  • num_clusters int - Number of KMeans clusters to use
  • num_em_iter int - Number of em iterations
  • num_threads int - Number of threads for performing the feature vector extraction
  • num_images int - Limit the number of images
  • model_path str - Model path for the model to be used for feature vector extraction
  • license str - License string
  • nearest_neighbors_k int - When assigning an image into a cluster, how many clusters to assign to (starting from the closest)
  • d int - Dimension of the feature vector

Returns:

  • ret int - 0 in case of success, 1 in case of error

extract_video_frames

def extract_video_frames(input_dir,
                         work_dir,
                         verbose=False,
                         num_threads=-1,
                         num_images=0,
                         min_offset=0,
                         max_offset=0,
                         turi_param="",
                         model_path=model_path_full,
                         d=576,
                         resize_video=0,
                         keyframes_only=1,
                         license="")

A function to go over a collection of videos and etract them into frames. The output is saved to the work_dir/tmp
subfolder.

Arguments:

input_dir (str):
Location of the videos to extract.

  • A folder
  • A remote folder (s3 or minio starting with s3:// or minio://). When using minio append the minio server name for example minio://google/visual_db/sku110k.
  • A file containing absolute filenames each on its own row.
  • A file containing s3 full paths or minio paths each on its own row.
  • A python list with absolute filenames
  • We support api/mp4 video formats.
  • work_dir str - Optional path for storing intermediate files and results.
  • verbose boolean - Verbosity.
  • num_threads int - Number of threads. If no value is specified num threads is auto configured by the number of cores.
  • num_images unsigned long long - Number of images to run on. On default, run on all the images in the image_dir folder.
  • turi_param str - Optional turi parameters seperated by command. Example run: turi_param='nnmodel=0,ccthreshold=0.99'
    The following parameters are supported.
    • nnmodel=xx, Nearest Neighbor model for clustering the features together. Supported options are 0 = brute_force (exact), 1 = ball_tree and 2 = lsh (both approximate).
    • ccthreshold=xx, Threshold for running connected components to find clusters of similar images. Allowed values 0->1. The default ccthreshold is 0.96. This groups very similar images together, for example identical images or images that went
      simple transformations like scaling, flip, zoom in. As higher the score the more similar images are grouped by and you will get
      smaller clusters. Score 0.9 is pretty broad and will clsuter images together even if they fine details are not similar.
      It is recommended to experiment with this parameter based on your dataset and then visualize the results using fastdup.create_components_gallery().
    • run_cc=0|1 run connected components on the resulting similarity graph. Default is 1.
    • run_pagerank=0|1 run pagerank on the resulting similarity graph. Default is 1.
    • delete_tar=0|1 when working with tar files obtained from cloud storage delete the tar after download
    • delete_img=0|1 when working with images obtained from cloud storage delete the image after download
    • tar_only=0|1 run only on tar files and ignore images in folders. Default is 0.
    • run_stats=0|1 compute image statistics. Default is 1.
    • sync_s3_to_local=0|1 In case of using s3 bucket sync s3 to local folder to improve performance. Assumes there is enough local disk space to contain the dataDefault is 0. \
  • min_offset unsigned long long - Optional min offset to start iterating on the full file list.
  • max_offset unsigned long long - Optional max offset to start iterating on the full file list.
  • resize_video int - 0 = do not resize video, 1 = resize video based on the model_path dimensions
  • keyframes_only int - 0 = extract all frames, 1 = extract only keyframes
  • model_path str - optional string to point to alternatiuve onnx or ort model
  • d int - output feature vector for model

Returns:

  • ret int - Status code 0 = success, 1 = error.