V0.2xx API
This page holds the API reference for fastdup V0.2xx API, which is completely supported in the current releases, and includes a few features not yet covered in the V1.0 API.
fastdup
run
def run(input_dir='',
work_dir='.',
test_dir='',
compute='cpu',
verbose=False,
num_threads=-1,
num_images=0,
turi_param='nnmodel=0',
distance='cosine',
threshold=0.9,
lower_threshold=0.05,
model_path=model_path_full,
license='',
version=False,
nearest_neighbors_k=2,
d=576,
run_mode=0,
nn_provider='nnf',
min_offset=0,
max_offset=0,
nnf_mode="HNSW32",
nnf_param="",
bounding_box="",
batch_size=1,
resume=0,
high_accuracy=False)Run fastdup tool for finding duplicate, near duplicates, outliers and clusters of related images in a corpus of images.
The only mandatory argument is image_dir.
Arguments:
input_dir (str):
Location of the images/videos to analyze.
-
A folder
-
A remote folder (s3 or minio starting with s3:// or minio://). When using minio append the minio server name for example minio://google/visual_db/sku110k.
-
A file containing absolute filenames each on its own row.
-
A file containing s3 full paths or minio paths each on its own row.
-
A python list with absolute filenames.
-
A python list with absolute folders, all images and videos on those folders are added recusively
-
For run_mode=2, a folder containing fastdup binary features or a file containing list of atrain_feature.dat.csv files in multiple folders
-
yolov5 yaml input file containing train and test folders (single folder supported for now)
-
We support jpg, jpeg, tiff, tif, giff, heif, heic, bmp, png, mp4, avi. In addition we support tar, tar.gz, tgz and zip files containing images.
If you have other image extensions that are readable by opencv imread() you can give them in a file (each image on its own row) and then we do not check for the
known extensions and use opencv to read those formats. -
Note- It is not possible to mix compressed (videos or tars/zips) and regular images. Use the flag turi_param='tar_only=1' if you want to ignore images and run from compressed files. -
Note- We assume image sizes should be larger or equal to 10x10 pixels. Smaller images (either on width or on height) will be ignored with a warning shown. -
Note- It is possible to skip small images also by defining minimum allowed file size using turi_param='min_file_size=1000' (in bytes). -
Note- For performance reasons it is always preferred to copy s3 images from s3 to local disk and then run fastdup on local disk. Since copying images from s3 in a loop is very slow.
Alternatively you can use the flag turi_param='sync_s3_to_local=1' to copy ahead all images on the remote s3 bucket to disk. -
Note- fastdup plus beta version now supports bounding boxes on the c++ side. To use it prepare an input file with the following csv header: filename,col_x,row_y,width,height where each row as an image file
and bounding box information in the above format. Fastdup will run on the bounding box level and the reports will be generated on the bounding box level. For using bounding boxes please sign up
for our free beta program at https://visual-layer.com or send an email to [email protected]. -
work_dirstr - Path for storing intermediate files and results. -
test_dirstr - Optional path for test data. When given similarity of train and test images is compared (vs. train/train or test/test which are not performed).
The following options are supported.- test_dir can be a local folder path
- An s3:// or minio:// path.
- A python list with absolute filenames
- A file containing absolute filenames each on its own row.
-
computestr - Compute type [cpu|gpu] Note: gpu is supported only in the enterprise version. -
verboseboolean - Verbosity. -
num_threadsint - Number of threads. If no value is specified num threads is auto configured by the number of cores. -
num_imagesunsigned long long - Number of images to run on. On default, run on all the images in the image_dir folder. -
turi_paramstr - Optional turi parameters seperated by command. Example run: turi_param=ccthreshold=0.99'
The following parameters are supported.ccthreshold=xx, Threshold for running connected components to find clusters of similar images. Allowed values 0->1. The default ccthreshold is 0.96. This groups very similar images together, for example identical images or images that went
simple transformations like scaling, flip, zoom in. As higher the score the more similar images are grouped by and you will get
smaller clusters. Score 0.9 is pretty broad and will clsuter images together even if they fine details are not similar.
It is recommended to experiment with this parameter based on your dataset and then visualize the results usingfastdup.create_components_gallery().run_cc=0|1run connected components on the resulting similarity graph. Default is 1.run_pagerank=0|1run pagerank on the resulting similarity graph. Default is 1.delete_tar=0|1when working with tar files obtained from cloud storage delete the tar after downloaddelete_img=0|1when working with images obtained from cloud storage delete the image after downloadtar_only=0|1run only on tar files and ignore images in folders. Default is 0.run_stats=0|1compute image statistics. Default is 1.sync_s3_to_local=0|1In case of using s3 bucket sync s3 to local folder to improve performance. Assumes there is enough local disk space to contain the dataDefault is 0.\
-
distancestr - Distance metric for the Nearest Neighbors algorithm. The default is 'cosine' which works well in most cases.
For nn_provider='nnf' the following distance metrics are supported.
When using nnf_mode='Flat': 'cosine', 'euclidean', 'l1','linf','canberra','braycurtis','jensenshannon' are supported.
Otherwise 'cosine' and 'euclidean' are supported. -
thresholdfloat - Similarity measure in the range 0->1, where 1 is totally identical, 0.98 and above is almost identical. -
lower_thresholdfloat - Similarity percentile measure to outline images that are far away (outliers) vs. the total distribution. (means 5% out of the total similarities computed). -
model_pathstr - Optional location of ONNX model file, should not be used. -
versionbool - Print out the version number. This function takes no argument. -
nearest_neighbors_kint - For each image, how many similar images to look for. -
dint - Length of the feature vector. On default it is 576. When you use your own onnx model, change this parameter to the output model feature vector length.run_mode (int):
0(the default) does the feature extraction and NN embedding to compute all pairs similarities.
It uses the input_dir command line argument for finding the directory to run on (or a list of files to run on).
The features are extracted and saved into the working_dir path (the default features out file nme is
features.datin the same folder for storing the numpy features and features.dat.csv for storing the
image file names corresponding to the numpy features).
For larger dataset it may be wise to split the run into two, to make sure intermediate results are stored in case you encounter an error.1computes the extracted features and stores them, does not compute the NN embedding.
For large datasets, it is possible to run on a few computing nodes, to extract the features, in parallel.
Use the min_offset and max_offset flags to allocate a subset of the images for each computing node.
Offsets start from 0 to n-1 where n is the number of images in the input_dir folder.2reads a stored feature file and computes the NN embedding to provide similarities.
The input_dir param is ignored, and the work_dir is used to point to the numpy feature file. (Give a full path and filename).3Reads the NN model stored by nnf.index from the work_dir and computes all pairs similarity on all images
given by the test_dir parameter. input_dir should point to the location of the train data.
This mode is used for scoring similarities on a new test dataset given a precomputed simiarity index on a train dataset.4reads the NN model stored bynnf.indexfrom thework_dirand computes all pairs similarity on pre extracted feature vectors computer by run_mode=1.\ -
min_offsetunsigned long long - Optional min offset to start iterating on the full file list. -
max_offsetunsigned long long - Optional max offset to start iterating on the full file list. -
nnf_modestr - When nn_provider='nnf' selects the nnf model mode.
default is HSNW32. More accurate is Flat. -
nnf_paramstr - Selects assigns optional parameters.
num_em_iter, number of KMeans EM iterations to run. Default is 20.
num_clusters, number of KMeans clusters to use. Default is 100. -
bounding_boxstr - Optional bounding box to crop images, given as bounding_box='row_y=xx,col_x=xx,height=xx,width=xx'. This defines a global bounding box to be used for all images.
Beta release features (need to sign up at https://visual-layer.com): Tt is possible to set bounding_box='face' to crop the face from the image (in case a face is present).
In addition, you can set bounding_box='yolov5s' and we will run yolov5s to create and crop bounding boxes on your data. (We do not host this model, it is downloaded from the relevant github proejct).
For the face/yolov5 crop the margin around the face is defined by turi_param='augmentation_horiz=0.2,augmentation_vert=0.2' where 0.2 mean 20% additional margin around the face relative to the width and height respectively.
It is possible to change the margin, the lowest value is 0 (no margin) and upper allowed value is 1. Default is 0.2. -
batch_sizeint - Optional batch size when computing inference. Allowed values < 200. Note: batch_size > 1 is enabled in the enterprise version. -
resumeint - Optional flag to resume from a previous run. -
high_accuracybool - Compute a more accurate model. Runtime is increased about 15% and feature vector storage size/ memory is increased about 60%. The upside is the model can distinguish better of minute details in images with many objects.
Returns:
retint - Status code 0 = success, 1 = error.
run_on_webdataset
def run_on_webdataset(input_dir='',
work_dir='.',
test_dir='',
compute='cpu',
verbose=False,
num_threads=-1,
num_images=0,
turi_param='nnmodel=0',
distance='cosine',
threshold=0.9,
lower_threshold=0.05,
model_path=model_path_full,
license='',
version=False,
nearest_neighbors_k=2,
d=576,
nn_provider='nnf',
min_offset=0,
max_offset=0,
nnf_mode="HNSW32",
nnf_param="",
bounding_box="",
batch_size=1)Run the FastDup software on a web dataset.
This run is composed of two stages. First extract all feature vectors using run_mode=1, then run the nearest neighbor model using run_mode=2.
Make sure that work_dir has enough free space for extracting tar files. Tar files are extracted temporarily into work_dir/tmp folder.
You can control the free space using the flags turi_param='delete_tar=1|0' and delete_img='1|0'. When delete_tar=1 the tars are processed one by one and deleted after processing.
When delete_img=1 the images are processed one by one and deleted after processing.
load_binary_feature
def load_binary_feature(filename, d=576)Python function for loading the stored binary features written by fastdup and their matching filenames and analyzing them in Python.
Arguments:
filenamestr - The binary feature file locationdint - Feature vector length
Returns:
filenameslist - A list of with all image file names of length X.np_arraynp.array - An np matrix of shape rows x d cols (default d is 576). Each row conform to feature vector os a single image.
Example:
import fastdup
file_list, mat_features = fastdup.load_binary(FILENAME_FEATURES)
save_binary_feature
def save_binary_feature(save_path, filenames, np_array)Function for saving data to be used by fastdup. Given a list of images and their matching feature vectors in a numpy array,
function saves data in a format readable by fastdup. This saves the image extraction step, to be used with run_mode=1 namely perform
nearest neighbor model on the feature vectors.
Arguments:
save_pathstr - Working folder to save the files tofilenameslist - A list of file location of the images (absolute paths) of length n imagesnp_arraynp.array - Numpy array of size n x d. Each row is a feature vector of one file.
Returns:
retint - 0 in case of success, otherwise 1
create_duplicates_gallery
def create_duplicates_gallery(similarity_file,
save_path,
num_images=20,
descending=True,
lazy_load=False,
get_label_func=None,
slice=None,
max_width=None,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
input_dir=None,
work_dir=None,
threshold=None,
**kwargs)Function to create and display a gallery of duplicate/near duplicate images as computed by the similarity metric.
In addition, it is possible to compute hierarchical gallery of duplicate/near duplicate clusters. For doing so need to
(A) Run fastdup to compute similarity on work_dir
(B) Run connected components on the work_dir saving the component results to save_path (need to run with lazy_load=True)
(C) Run create_duplicates_gallery() on the components to find pairs of similar components. Point the similarity_file to similarity_hierarchical_XX.csv file where XX is the
connected components threshold (ccthreshold=XX).
Example:
import fastdup
fastdup.run('input_folder', 'output_folder')
fastdup.create_duplicates_gallery('output_folder', save_path='.', get_label_func = lambda x: x.split('/')[1], slice='hamburger')
Regarding get_label_func, this example assumes that the second folder name is the class name for example my_data/hamburger/image001.jpg. You can change it to match your own labeling convention.
Arguments:
similarity_filestr - csv file with the computed similarities by the fastdup tool, or a work_dir path, or a pandas dataframe containing the similarities.save_pathstr - output folder location for the visualsnum_imagesint - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.descendingboolean - If False, print the similarities from the least similar to the most similar. Default is True.lazy_loadboolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.slicestr - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
slice could be a specific label i.e. slice='haumburger' and in that case only similarities between hamburger and other classes are presented.
Two reserved arguments for slice are "diff" and "same". When using "diff" the report only shows similarities between classes. When using "same" the report will show only similarities inside same class.
Note that when using slice, the function get_label_function should be implmeneted.max_widthint - Optional parameter to set the max width of the gallery.get_bounding_box_funccallable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_funccallable - Optional parameter to allow changing the presented filename into another string.
The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_funccallable - Optional parameter to allow adding additional column to the reportinput_dirstr - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'work_dirstr - Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a duplicate file paththresholdfloat - Optional parameter to specify the threshold for similarity score to be considered as duplicate. Values above the threshold will be considered as duplicate.
Allowed values are between 0 and 1.save_artifactsboolean - Optional parameter to allow saving the intermediate artifacts (raw images, csv with results) to the output folder
create_duplicate_videos_gallery
def create_duplicate_videos_gallery(similarity_file,
save_path,
num_images=20,
descending=True,
lazy_load=False,
get_label_func=None,
slice=None,
max_width=None,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
input_dir=None,
work_dir=None,
threshold=None,
**kwargs)Function to create and display a gallery of duplicaate videos computed by the similarity metrics
Example:
import fastdup
fastdup.run('input_folder', 'output_folder', run_mode=1) # extract frames from videos
fastdup.run('input_folder', 'output_folder', run_mode=2) # run fastdup
fastdup.create_duplicates_videos_gallery('output_folder', save_path='.')
Arguments:
similarity_filestr - csv file with the computed similarities by the fastdup tool, or a work_dir path, or a pandas dataframe containing the similarities.save_pathstr - output folder location for the visualsnum_imagesint - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.descendingboolean - If False, print the similarities from the least similar to the most similar. Default is True.lazy_loadboolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.slicestr - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
slice could be a specific label i.e. slice='haumburger' and in that case only similarities between hamburger and other classes are presented.
Two reserved arguments for slice are "diff" and "same". When using "diff" the report only shows similarities between classes. When using "same" the report will show only similarities inside same class.
Note that when using slice, the function get_label_function should be implmeneted.max_widthint - Optional parameter to set the max width of the gallery.get_bounding_box_funccallable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_funccallable - Optional parameter to allow changing the presented filename into another string.
The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_funccallable - Optional parameter to allow adding additional column to the reportinput_dirstr - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'work_dirstr - Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a duplicate file paththresholdfloat - Optional parameter to specify the threshold for similarity score to be considered as duplicate. Values above the threshold will be considered as duplicate.
Allowed values are between 0 and 1.save_artifactsboolean - Optional parameter to allow saving the intermediate artifacts (raw images, csv with results) to the output folder
create_outliers_gallery
def create_outliers_gallery(outliers_file,
save_path,
num_images=20,
lazy_load=False,
get_label_func=None,
how='one',
slice=None,
max_width=None,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
input_dir=None,
work_dir=None,
**kwargs)Function to create and display a gallery of images computed by the outliers metrics.
Outliers are computed using the fastdup tool, by embedding each image to a short feature vector, finding top k similar neighbors
and finding images that are further away from all other images, i.e. outliers.
On default fastdup saves the outliers into a file called outliers.csv inside the work_dir folder.
It is possible to load this file using pandas to get the list of outlir images.
Note that the number of images included in the outliers file depends on the lower_threshold parameter in the fastdup run. This command line argument is a percentile
i.e. 0.05 means top 5% of the images that are further away from the rest of the images are considered outliers.
Arguments:
outliers_filestr - csv file with the computed outliers by the fastdup tool, or a work_dir path, or a pandas dataframe contraining the outlierssave_pathstr - output folder location for the visualsnum_imagesint - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.lazy_loadboolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.howstr - Optional outlier selection method. one = take the image that is far away from any one image (but may have other images close to it).
all = take the image that is far away from all other images. Default is one.slicestr - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.max_widthint - Optional parameter to set the max width of the gallery.get_bounding_box_funccallable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_funccallable - Optional parameter to allow changing the presented filename into another string.
The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_funccallable - Optional parameter to allow adding additional column to the reportinput_dirstr - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'work_dirstr - Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a outliers file path
create_components_gallery
def create_components_gallery(work_dir,
save_path,
num_images=20,
lazy_load=False,
get_label_func=None,
group_by='visual',
slice=None,
max_width=None,
max_items=None,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
threshold=None,
metric=None,
descending=True,
min_items=None,
keyword=None,
input_dir=None,
**kwargs)Function to create and display a gallery of images for the largest graph components
Arguments:
work_dirstr - path to fastdup work_dir, or a path to connected component csv file. Altenatively dataframe with connected_compoennts.csv content from previous fastdup run.save_pathstr - output folder location for the visualsnum_imagesint - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.lazy_loadboolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.group_bystr - [visual|label]. Group the report using the visual properties of the image or using the labels of the images. Default is visual.slicestr or list - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.max_widthint - Optional parameter to set the max html width of images in the gallery. Default is None.max_itemsint - Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.get_bounding_box_funccallable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_funccallable - Optional parameter to allow changing the presented filename into another string. The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_funccallable - Optional parameter to allow adding more information to the report.thresholdfloat - Optional parameter to set the treshold for chosing components. Default is None.metricstr - Optional parameter to set the metric to use (like blur) for chose components. Default is None.descendingboolean - Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.min_itemsint - Optional parameter to select components with min_items or more items. Default is None.keywordstr - Optional parameter to select components with keyword asa subset of the label. Default is None.input_dirstr - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'kwargsdict - Optional parameter to pass additional parameters to the function.split_sentence_to_label_listboolean - Optional parameter to split the label into a list of labels. Default is False.limit_labels_printedint - Optional parameter to limit the number of labels printed in the html report. Default is max_items.nrowsint - limit the number of read rows for debugging purposes of the reportsave_artifactsbool - Optional param to save intermediate artifacts like image paths used for generating the component
Returns:
retint - 0 in case of success, otherwise 1
create_component_videos_gallery
def create_component_videos_gallery(work_dir,
save_path,
num_images=20,
lazy_load=False,
get_label_func=None,
group_by='visual',
slice=None,
max_width=None,
max_items=None,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
threshold=None,
metric=None,
descending=True,
min_items=None,
keyword=None,
input_dir=None,
**kwargs)Function to create and display a gallery of similar videos based on the graph components
Arguments:
work_dirstr - path to fastdup work_dirsave_pathstr - output folder location for the visualsnum_imagesint - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.lazy_loadboolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.group_bystr - [visual|label]. Group the report using the visual properties of the image or using the labels of the images. Default is visual.slicestr or list - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.max_widthint - Optional parameter to set the max html width of images in the gallery. Default is None.max_itemsint - Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.get_bounding_box_funccallable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_funccallable - Optional parameter to allow changing the presented filename into another string. The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_funccallable - Optional parameter to allow adding more information to the report.thresholdfloat - Optional parameter to set the treshold for chosing components. Default is None.metricstr - Optional parameter to set the metric to use (like blur) for chose components. Default is None.descendingboolean - Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.min_itemsint - Optional parameter to select components with min_items or more items. Default is None.keywordstr - Optional parameter to select components with keyword asa subset of the label. Default is None.input_dirstr - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'
Returns:
retint - 0 in case of success, otherwise 1
create_kmeans_clusters_gallery
def create_kmeans_clusters_gallery(work_dir,
save_path,
num_images=20,
lazy_load=False,
get_label_func=None,
slice=None,
max_width=None,
max_items=None,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
threshold=None,
metric=None,
descending=True,
min_items=None,
keyword=None,
input_dir=None,
**kwargs)Function to visualize the kmeans clusters.
Arguments:
work_dirstr - path to fastdup work_dirsave_pathstr - output folder location for the visualsnum_imagesint - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.lazy_loadboolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.slicestr or list - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.max_widthint - Optional parameter to set the max html width of images in the gallery. Default is None.max_itemsint - Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.get_bounding_box_funccallable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_funccallable - Optional parameter to allow changing the presented filename into another string. The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_funccallable - Optional parameter to allow adding more information to the report.thresholdfloat - Optional parameter to set the treshold for chosing components. Default is None.metricstr - Optional parameter to set the metric to use (like blur) for chose components. Default is None.descendingboolean - Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.min_itemsint - Optional parameter to select components with min_items or more items. Default is None.keywordstr - Optional parameter to select components with keyword asa subset of the label. Default is None.input_dirstr - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'
Returns:
retint - 0 in case of success, otherwise 1
remove_duplicates
def remove_duplicates(input_dir, work_dir=None, distance=0.96, dry_run=True)function to automate deletion of duplicate images, where similarity is above the distance param.
Arguments:
input_dirsee fastdup.run()work_dirsee fastdup.run()distancedelete only components which has similarity larger thanmin_distance.dry_runbool - if True does not delete but print the rm commands used, otherwise deletes
Returns:retlist - list of deleted files
delete_components
def delete_components(top_components, to_delete=None, min_distance=0.96, how='one', dry_run=True)function to automate deletion of duplicate images using the connected components analysis.
Example:
import fastdup
fastdup.run('/path/to/data', '/path/to/output')
top_components = fastdup.find_top_components('/path/to/output')top_components = top_components[top_components['distance'] > 0.99] # remove any components which are similar more than 0.99
delete_components(top_components, None, how = 'one', dry_run = False)
Arguments:
top_componentspd.DataFrame - largest components as found by the function find_top_components().to_deletelist - a list of integer component ids to delete. On default None which means delete duplicates from all components.min_distancedelete only components which has similarity larger thanmin_distance.howint - either 'all' (deletes all the component) or 'one' (leaves one image and delete the rest of the duplicates)dry_runbool - if True does not delete but print the rm commands used, otherwise deletes
Returns:
retlist - list of deleted files
delete_components_by_label
def delete_components_by_label(top_components_file,
min_items=10,
min_distance=0.96,
how='one',
dry_run=True)function to automate deletion of duplicate images using the connected components analysis.
Arguments:
top_componentspd.DataFrame - largest components as found by the function find_top_components().to_deletelist - a list of integer component ids to deletemin_distancedelete only components which has similarity larger thanmin_distance.howint - either 'all' (deletes all the component) or 'majority' (leaves one image with the dominant label count and delete the rest)dry_runbool - if True does not delete but print the rm commands used, otherwise deletes
Returns:
retlist - list of deleted files
delete_or_retag_stats_outliers
def delete_or_retag_stats_outliers(stats_file,
metric,
filename_col='filename',
label_col=None,
lower_percentile=None,
upper_percentile=None,
lower_threshold=None,
upper_threshold=None,
get_reformat_filename_func=None,
dry_run=True,
how='delete',
save_path=None,
work_dir=None)function to automate deletion of outlier files based on computed statistics.
Example:
import fastdup
fastdup.run('/my/data/", work_dir="out")
delete 5% of the brightest images and delete 2% of the darkest images
fastdup.delete_or_retag_stats_outliers("out", metric="mean", lower_percentile=0.05, dry_run=False)
It is recommended to run with dry_run=True first, to see the list of files deleted before actually deleting.
Example:
This example first find wrong labels using similarity gallery and then deletes anything with score < 51.
Score is in range 0-100 where 100 means this image is similar only to images from the same class label.
Score 0 means this image is only similar to images from other class labels.
import fastdup
df2 = create_similarity_gallery(..., get_label_func=...)
fastdup.delete_or_retag_stats_outliers(df2, metric='score', filename_col = 'from', lower_threshold=51, dry_run=True)
Note- it is possible to run with bothlower_percentileandupper_percentileat once. It is not possible to run withlower_percentileandlower_thresholdat once since they may be conflicting.
Arguments:
stats_file (str):
- folder pointing to fastdup workdir or
- file pointing to work_dir/atrain_stats.csv file or
- pandas DataFrame containing list of files giveb in the filename_col column and a metric column.
metricstr - statistic metric, should be one of "blur", "mean", "min", "max", "stdv", "unique", "width", "height", "size"filename_colstr - column name in the stats_file to use as the filenamelower_percentilefloat - lower percentile to use for the threshold. Values are 0->1, where 0.05 means remove 5% of the lowest values.upper_percentilefloat - upper percentile to use for the threshold. Values are 0->1, where 0.95 means remove 5% of the upper values.lower_thresholdfloat - lower threshold to use for the threshold. Only used if lower_percentile is None.upper_thresholdfloat - upper threshold to use for the threshold. Only used if upper_percentile is None.get_reformat_filename_funccallable - Optional parameter to allow changing the filename into another string. Useful in the case fastdup was run on a different folder or machine and you would like to delete files in another folder.dry_runbool - if True does not delete but print the rm commands used, otherwise deleteshowstr - either 'delete' or 'move' or 'retag'. In case of retag allowed value is retag=labelImg or retag=cvatsave_pathstr - optional. In case of a folder and how == 'retag' the label files will be moved to this folder.work_dirstr - optional. In case of stats dataframe, point to fastdup work_dir.
Returns:
retlist - list of deleted files (or moved or retagged files)
export_to_tensorboard_projector
def export_to_tensorboard_projector(work_dir,
log_dir,
sample_size=900,
sample_method='random',
with_images=True,
get_label_func=None,
d=576,
file_list=None)Export feature vector embeddings to be visualized using tensorboard projector app.
Example:
import fastdup
fastdup.run('/my/data/', work_dir='out')
fastdup.export_to_tensorboard_projector(work_dir='out', log_dir='logs')
After data is exporeted run tensorboard projector
%load_ext tensorboard
%tensorboard --logdir=logs
Arguments:
work_dirstr - work_dir where fastdup results are storedlog_dirstr - output dir where tensorboard will read fromsample_sizeint - how many images to view. Default is 900.sample_methodstr - how to sample, currently 'random' is supported.with_imagesbool - add images to the visualization (default True)get_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.dint - dimension of the embedding vector. Default is 576.file_listlist - Optional parameter to specify a list of files to be used for the visualization. If not specified, filenames are taken from the work_dir/atrain_features.dat.csv fileNote- be careful here as the order of the file_list matters, need to keep the exact same order as the atrain_features.dat.csv file!
Returns:
retint - 0 in case of success, 1 in case of failure
generate_sprite_image
def generate_sprite_image(img_list,
sample_size,
log_dir,
get_label_func=None,
h=0,
w=0,
alternative_filename=None,
alternative_width=None,
max_width=None)Generate a sprite image of images for tensorboard projector. A sprite image is a large image composed of grid of smaller images.
Arguments:
img_listlist - list of image filenames (full path)sample_sizeint - how many images in to plotlog_dirstr - directory to save the sprite imageget_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.hint - optional requested hight of each subimagewint - optional requested width of each subimagealternative_filenamestr - optional parameter to save the resulting image to a different namealternative_widthint - optional parameter to control the number of images per rowmax_widthint - optional parameter to control the rsulting width of the image
Returns:
pathstr - path to sprite imagelabelslist - list of labels
find_top_components
def find_top_components(work_dir,
get_label_func=None,
group_by='visual',
slice=None,
threshold=None,
metric=None,
descending=True,
min_items=None,
max_items=None,
keyword=None,
save_path=None,
comp_type="component",
**kwargs)Function to find the largest components of duplicate images
Arguments:
work_dirstr - working directory where fastdup.run was run.get_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.group_bystr - optional parameter to group by 'visual' or 'label'. When grouping by visual fastdup aggregates visually similar images together.
When grouping by 'label' fastdup aggregates images with the same label together.slicestr - optional parameter to slice the results by a specific label. For example, if you want to slice by 'car' then pass 'car' as the slice parameter.thresholdfloat - optional threshold to select only distances larger than the tresholdmetricstr - optional metric to sort by. Valid values are mean,min,max,unique,blur,sizedescendingbool - optional value to sort the components, default is Truemin_itemsint - optional value, select only components with at least min_itemsmax_itemsint - optional value, select only components with at most max_itemskeywordstr - optional, select labels with keyword value insidesave_pathstr - optional, save pathcomp_typestr - optional, either component or cluster
Returns:
dfpd.DataFrame - of top components. The column component_id includes the component name.
The column files includes a list of all image files in this component.
init_search
def init_search(k, work_dir, d=576, model_path=model_path_full, verbose=False)Initialize real time search and precomputed nnf data.
This function should be called only once before running searches. The search function is search().
Arguments:
kint - number of nearest neighbors to search forwork_dirstr - working directory where fastdup.run was run.dint - dimension of the feature vector. Default is 576.model_pathstr - path to the onnx model file. Optional.verbosebool - (Optional): True for verbose mode.
Example:
import fastdup
input_dir = "/my/input/dir"
work_dir = "/my/work/dir"
fastdup.run(input_dir, work_dir)
# point to the work_dir where fastdup was run
fastdup.init_search(10, work_dir, verbose=True)
# The below code can be executed multiple times, each time with a new searched image
df = fastdup.search("myimage.jpg", None, verbose=True)
# optional: display search output
fastdup.create_duplicates_gallery(df, ".", input_dir=input_dir)Note: fastdup model was trained with Image resize via Resampling.NEAREST and the BGR channel swapped to RGB.
In case you use other models, need to check their requirements.
Returns:
retint - 0 in case of success, otherwise 1.
search
def search(img, size, verbose=0)Search for similar images in the image database.
Arguments:
imgstr - the image to search forsizeint - image size width x heightverboseint - run in verbose mode
Returns:
retint - 0 = in case of success, 1 = in case of failure
The output file is created on work_dir/similrity.csv as initialized by init_search
create_stats_gallery
def create_stats_gallery(stats_file,
save_path,
num_images=20,
lazy_load=False,
get_label_func=None,
metric='blur',
slice=None,
max_width=None,
descending=False,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
input_dir=None,
work_dir=None,
**kwargs)Function to create and display a gallery of images computed by the statistics metrics.
Supported metrics are: mean (color), max (color), min (color), stdv (color), unique (number of unique colors), bluriness (computed by the variance of the laplpacian method
see https://theailearner.com/2021/10/30/blur-detection-using-the-variance-of-the-laplacian-method/.
The metrics are created by fastdup.run() and stored into the work_dir into a file named atrain_stats.csv. Note that the metrics are computed
on the fly fastdup loads and resizes every image only once.
Arguments:
stats_filestr - csv file with the computed image statistics by the fastdup tool, alternatively a pandas dataframe. Default stats file is saved by fastdup.run() into the folderwork_dirasatrain_stats.csv.save_pathstr - output folder location for the visualsnum_imagesint - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.lazy_loadboolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.metricstr - Optional metric selection. Supported metrics are:- width - of original image before resize
- height - of original image before resize
- size - area
- file_size - file size in bytes
- blur - variance of the laplacian
- unique - number of unique colors, 0..255
- mean - mean color 0..255
- max - max color 0..255
- min - min color 0..255
Advanced metris include (for running advanced metrics, run with turi_param='run_advanced_stats=1') - contrast
- rms_contrast - square root of mean sum of stdv/mean per channel
- mean_rel_intensity_r
- mean_rel_intensity_b
- mean_rel_intensity_g
- mean_hue - transform to HSV and compute mean H
- mean_saturation - transform to HSV and compute mean S
- mean_val - transform to HSV and compute mean V
- edge_density - using canny filter
- mean_r - mean of R channel
- mean_g - mean of G channel
- mean_b - mean of B channel
slicestr - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.max_widthint - Option parameter to select the maximal image width in the reportdescendingbool - Optional parameter to control the order of the metricget_bounding_box_funccallable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_funccallable - Optional parameter to allow changing the presented filename into another string.
The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_funccallable - Optional parameter to allow adding extra columns to the gallery.input_dirstr - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'work_dirstr - Optional parameter to fastdup work_dir. Needed when stats file is a pd.DataFrame.
Returns:
retint - 0 in case of success, otherwise 1.
create_similarity_gallery
def create_similarity_gallery(similarity_file,
save_path,
num_images=20,
lazy_load=False,
get_label_func=None,
slice=None,
max_width=None,
descending=False,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
input_dir=None,
work_dir=None,
min_items=2,
max_items=None,
**kwargs)Function to create and display a gallery of images computed by the similarity metric. In each table row one query image is
displayed and num_images most similar images are displayed next to it on the right.
In case the dataset is labeled, the user can specify the label using the function get_label_func. In this case a score metric is computed to reflect how similar the query image to the most similar images in terms of class label.
Score 100 means that out of the top k num_images similar images, all similar images are from the same class. Score 0 means that the image is similar only to images which are from different class.
Score 50 means that the query image is similar to the same number of images from the same class and from other classes. The report is sorted by the score metric.
For high quality labeled dataset we expect the score to be high, low score may indicate class label issues.
Arguments:
similarity_filestr - csv file with the computed image statistics by the fastdup tool, or a path to the work_dir,
alternatively a pandas dataframe. In case of a pandas dataframe need to set work_dir to point to fastdup work_dir.save_pathstr - output folder location for the visualsnum_imagesint - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.lazy_loadboolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.slicestr - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
A special value is 'label_score' which is used for comparing both images and labels of the nearest neighbors. The score values are 0->100 where 0 means the query image is only similar to images outside its class, 100 means the query image is only similar to images from the same class.max_widthint - Optional param to limit the image widthdescendingbool - Optional param to control the order of the metricget_bounding_box_funccallable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_funccallable - Optional parameter to allow changing the presented filename into another string.get_extra_col_funccallable - Optional parameter to allow adding extra columns to the reportinput_dirstr - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'work_dirstr - Optional parameter to fastdup work_dir. Needed when similarity_file is a pd.DataFrame.min_itemsint - Optional parameter to select components with min_items or moremax_itemsint - Optional parameter to limit the number of items displayed
Returns:
retpd.DataFrame - similarity dataframe, for each image filename returns a list of top K similar images.
each row has the columns 'from', 'to', 'label' (optional), 'distance'
create_aspect_ratio_gallery
def create_aspect_ratio_gallery(stats_file,
save_path,
get_label_func=None,
lazy_load=False,
max_width=None,
num_images=0,
slice=None,
get_filename_reformat_func=None,
input_dir=None,
**kwargs)Function to create and display a gallery of aspect ratio distribution.
Arguments:
stats_filestr - csv file with the computed image statistics by the fastdup tool, or work_dir path or a pandas dataframe with the stats compouted by fastdup.save_pathstr - output folder location for the visualsget_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.lazy_loadboolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).max_widthint - optional parameter to limit the plot image widthnum_imagesint - optional number of images to compute the statistics on (default computes on all images)slicestr - optional parameter to slice the stats file based on a specific label or a list of labels.get_filename_reformat_funccallable - optional function to reformat the filename before displaying it.input_dirstr - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'
Returns:
retint - 0 in case of success, otherwise 1.
export_to_cvat
def export_to_cvat(files, labels, save_path)Function to export a collection of files that needs to be annotated again to cvat batch job format.
This creates a file named fastdup_label.zip in the directory save_path.
The files can be retagged in cvat using Tasks -> Add (plus button) -> Create from backup -> choose the location of the fastdup_label.zip file.
Arguments:
files (str):
labels (str):
save_path (str):
Returns:
retint - 0 in case of success, otherwise 1.
export_to_labelImg
def export_to_labelImg(files, labels, save_path)Function to export a collection of files that needs to be annotated again to cvat batch job format.
This creates a file named fastdup_label.zip in the directory save_path.
The files can be retagged in cvat using Tasks -> Add (plus button) -> Create from backup -> choose the location of the fastdup_label.zip file.
Arguments:
files (str):
labels (str):
save_path (str):
Returns:
retint - 0 in case of success, otherwise 1.
top_k_label
def top_k_label(labels_col,
distance_col,
k=10,
threshold=None,
min_count=None,
unknown_class=None)Function to classify examples based on their label using the top k nearest neighbors.
Decision is made by accounting for the majority of the neighbors.
Arguments:
labels_collist - list of labelsdistance_collist - list of distanceskint - optional parameterthresholdfloat - optional parameter to consder neighbors with simiarity larger than thresholdmin_countint - optional parameter to consider only examples with at least min_count neighbors with the same labelunknown_class- optional parameter to add decisions to unknown class in cases there is no majority
Returns:
computed label
create_knn_classifier
def create_knn_classifier(work_dir, k, get_label_func, threshold=None)Function to create a knn classifier out of fastdup run. We assume there are existing labels to the datapoints.
Arguments:
work_dirstr - fastdup work_dir, or location of a similarity file, or a pandas DataFrame with the computed similaritieskint - (unused)get_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.thresholdfloat - optional threshold to consider neighbors with similarity larger than threshold
prediction per image to one of the given classes.
Returns:
dfpd.DataFrame - List of predictions using knn method
create_kmeans_classifier
def create_kmeans_classifier(work_dir, k, get_label_func, threshold=None)Function to create a knn classifier out of fastdup run. We assume there are existing labels to the datapoints.
Arguments:
work_dirstr - fastdup work_dir, or location of a similarity file, or a pandas DataFrame with the computed similaritieskint - (unused)get_label_funccallable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.thresholdfloat - (unused)
Returns:
dfpd.DataFrame - dataframe with filename, label and predicted label. Row per each image
run_kmeans
def run_kmeans(input_dir='',
work_dir='.',
verbose=False,
num_clusters=100,
num_em_iter=20,
num_threads=-1,
num_images=0,
model_path=model_path_full,
license='',
nearest_neighbors_k=2,
d=576,
bounding_box="",
high_accuracy=False)Run KMeans algorithm on a folder of images given by input_dir and save the results to work_dir.
Fastdup will extract feature vectors using the model specified by model_path and then run KMeans to cluster the vectors.
The results will be saved to work_dir in the following format:
kmeans_centroids.csv: a csv file containing the centroids of the clusters.kmeans_assignments.csv: assignment of each data point to the closet centroids (number of centroids given bynearest_neighbors_k).
After running kmeans you can usecreate_kmeans_clusters_galleryto view the results.
Arguments:
input_dirstr - path to the folder containing the images to be clustered. Seefastdup.runfor more details.work_dirstr - path to the folder where the results will be saved.verbosebool - verbosity level, default Falsenum_clustersint - Number of KMeans clusters to usenum_em_iterint - Number of em iterationsnum_threadsint - Number of threads for performing the feature vector extractionnum_imagesint - Limit the number of imagesmodel_pathstr - Model path for the model to be used for feature vector extractionlicensestr - License stringnearest_neighbors_kint - When assigning an image into a cluster, how many clusters to assign to (starting from the closest)dint - Dimension of the feature vectorbounding_boxstr - Optional bounding box see fastdup:::run for more detailshigh_accuracybool - Use higher accuracy model for the feature extraction
Returns:
retint - 0 in case of success, 1 in case of error
run_kmeans_on_extracted
def run_kmeans_on_extracted(input_dir='',
work_dir='.',
verbose=False,
num_clusters=100,
num_em_iter=20,
num_threads=-1,
num_images=0,
model_path=model_path_full,
license='',
nearest_neighbors_k=2,
d=576)Run KMeans algorithm on a folder of extracted feature vectors (created on default when running fastdup:::run).
The results will be saved to work_dir in the following format:
kmeans_centroids.csv: a csv file containing the centroids of the clusters. In each row one centroid. In totalnum_clustersrows.kmeans_assignments.csv: assignment of each data point to the closet centroids (number of centroids given bynearest_neighbors_k). In each row the image filename is listed, centoid id (starting from zero) and the L2 distance to the centroid.
After running kmeans you can usefastdup:::create_kmeans_clusters_galleryto view the results.
Arguments:
input_dirstr - path to the folder containing the images to be clustered. See fastup:::run for more details.work_dirstr - path to the folder where the results will be saved.verbosebool - verbosity level, default Falsenum_clustersint - Number of KMeans clusters to usenum_em_iterint - Number of em iterationsnum_threadsint - Number of threads for performing the feature vector extractionnum_imagesint - Limit the number of imagesmodel_pathstr - Model path for the model to be used for feature vector extractionlicensestr - License stringnearest_neighbors_kint - When assigning an image into a cluster, how many clusters to assign to (starting from the closest)dint - Dimension of the feature vector
Returns:
retint - 0 in case of success, 1 in case of error
extract_video_frames
def extract_video_frames(input_dir,
work_dir,
verbose=False,
num_threads=-1,
num_images=0,
min_offset=0,
max_offset=0,
turi_param="",
model_path=model_path_full,
d=576,
resize_video=0,
keyframes_only=1,
license="")A function to go over a collection of videos and etract them into frames. The output is saved to the work_dir/tmp
subfolder.
Arguments:
input_dir (str):
Location of the videos to extract.
- A folder
- A remote folder (s3 or minio starting with s3:// or minio://). When using minio append the minio server name for example minio://google/visual_db/sku110k.
- A file containing absolute filenames each on its own row.
- A file containing s3 full paths or minio paths each on its own row.
- A python list with absolute filenames
- We support api/mp4 video formats.
work_dirstr - Optional path for storing intermediate files and results.verboseboolean - Verbosity.num_threadsint - Number of threads. If no value is specified num threads is auto configured by the number of cores.num_imagesunsigned long long - Number of images to run on. On default, run on all the images in the image_dir folder.turi_paramstr - Optional turi parameters seperated by command. Example run: turi_param='nnmodel=0,ccthreshold=0.99'
The following parameters are supported.- nnmodel=xx, Nearest Neighbor model for clustering the features together. Supported options are 0 = brute_force (exact), 1 = ball_tree and 2 = lsh (both approximate).
- ccthreshold=xx, Threshold for running connected components to find clusters of similar images. Allowed values 0->1. The default ccthreshold is 0.96. This groups very similar images together, for example identical images or images that went
simple transformations like scaling, flip, zoom in. As higher the score the more similar images are grouped by and you will get
smaller clusters. Score 0.9 is pretty broad and will clsuter images together even if they fine details are not similar.
It is recommended to experiment with this parameter based on your dataset and then visualize the results usingfastdup.create_components_gallery(). - run_cc=0|1 run connected components on the resulting similarity graph. Default is 1.
- run_pagerank=0|1 run pagerank on the resulting similarity graph. Default is 1.
- delete_tar=0|1 when working with tar files obtained from cloud storage delete the tar after download
- delete_img=0|1 when working with images obtained from cloud storage delete the image after download
- tar_only=0|1 run only on tar files and ignore images in folders. Default is 0.
- run_stats=0|1 compute image statistics. Default is 1.
- sync_s3_to_local=0|1 In case of using s3 bucket sync s3 to local folder to improve performance. Assumes there is enough local disk space to contain the dataDefault is 0. \
min_offsetunsigned long long - Optional min offset to start iterating on the full file list.max_offsetunsigned long long - Optional max offset to start iterating on the full file list.resize_videoint - 0 = do not resize video, 1 = resize video based on the model_path dimensionskeyframes_onlyint - 0 = extract all frames, 1 = extract only keyframesmodel_pathstr - optional string to point to alternatiuve onnx or ort modeldint - output feature vector for model
Returns:
retint - Status code 0 = success, 1 = error.
Updated 6 months ago
