V0.2xx API
This page holds the API reference for fastdup V0.2xx API, which is completely supported in the current releases, and includes a few features not yet covered in the V1.0 API.
fastdup
run
def run(input_dir='',
work_dir='.',
test_dir='',
compute='cpu',
verbose=False,
num_threads=-1,
num_images=0,
turi_param='nnmodel=0',
distance='cosine',
threshold=0.9,
lower_threshold=0.05,
model_path=model_path_full,
license='',
version=False,
nearest_neighbors_k=2,
d=576,
run_mode=0,
nn_provider='nnf',
min_offset=0,
max_offset=0,
nnf_mode="HNSW32",
nnf_param="",
bounding_box="",
batch_size=1,
resume=0,
high_accuracy=False)
Run fastdup tool for finding duplicate, near duplicates, outliers and clusters of related images in a corpus of images.
The only mandatory argument is image_dir.
Arguments:
input_dir (str):
Location of the images/videos to analyze.
-
A folder
-
A remote folder (s3 or minio starting with s3:// or minio://). When using minio append the minio server name for example minio://google/visual_db/sku110k.
-
A file containing absolute filenames each on its own row.
-
A file containing s3 full paths or minio paths each on its own row.
-
A python list with absolute filenames.
-
A python list with absolute folders, all images and videos on those folders are added recusively
-
For run_mode=2, a folder containing fastdup binary features or a file containing list of atrain_feature.dat.csv files in multiple folders
-
yolov5 yaml input file containing train and test folders (single folder supported for now)
-
We support jpg, jpeg, tiff, tif, giff, heif, heic, bmp, png, mp4, avi. In addition we support tar, tar.gz, tgz and zip files containing images.
If you have other image extensions that are readable by opencv imread() you can give them in a file (each image on its own row) and then we do not check for the
known extensions and use opencv to read those formats. -
Note
- It is not possible to mix compressed (videos or tars/zips) and regular images. Use the flag turi_param='tar_only=1' if you want to ignore images and run from compressed files. -
Note
- We assume image sizes should be larger or equal to 10x10 pixels. Smaller images (either on width or on height) will be ignored with a warning shown. -
Note
- It is possible to skip small images also by defining minimum allowed file size using turi_param='min_file_size=1000' (in bytes). -
Note
- For performance reasons it is always preferred to copy s3 images from s3 to local disk and then run fastdup on local disk. Since copying images from s3 in a loop is very slow.
Alternatively you can use the flag turi_param='sync_s3_to_local=1' to copy ahead all images on the remote s3 bucket to disk. -
Note
- fastdup plus beta version now supports bounding boxes on the c++ side. To use it prepare an input file with the following csv header: filename,col_x,row_y,width,height where each row as an image file
and bounding box information in the above format. Fastdup will run on the bounding box level and the reports will be generated on the bounding box level. For using bounding boxes please sign up
for our free beta program at https://visual-layer.com or send an email to [email protected]. -
work_dir
str - Path for storing intermediate files and results. -
test_dir
str - Optional path for test data. When given similarity of train and test images is compared (vs. train/train or test/test which are not performed).
The following options are supported.- test_dir can be a local folder path
- An s3:// or minio:// path.
- A python list with absolute filenames
- A file containing absolute filenames each on its own row.
-
compute
str - Compute type [cpu|gpu] Note: gpu is supported only in the enterprise version. -
verbose
boolean - Verbosity. -
num_threads
int - Number of threads. If no value is specified num threads is auto configured by the number of cores. -
num_images
unsigned long long - Number of images to run on. On default, run on all the images in the image_dir folder. -
turi_param
str - Optional turi parameters seperated by command. Example run: turi_param=ccthreshold=0.99'
The following parameters are supported.ccthreshold=xx
, Threshold for running connected components to find clusters of similar images. Allowed values 0->1. The default ccthreshold is 0.96. This groups very similar images together, for example identical images or images that went
simple transformations like scaling, flip, zoom in. As higher the score the more similar images are grouped by and you will get
smaller clusters. Score 0.9 is pretty broad and will clsuter images together even if they fine details are not similar.
It is recommended to experiment with this parameter based on your dataset and then visualize the results usingfastdup.create_components_gallery()
.run_cc=0|1
run connected components on the resulting similarity graph. Default is 1.run_pagerank=0|1
run pagerank on the resulting similarity graph. Default is 1.delete_tar=0|1
when working with tar files obtained from cloud storage delete the tar after downloaddelete_img=0|1
when working with images obtained from cloud storage delete the image after downloadtar_only=0|1
run only on tar files and ignore images in folders. Default is 0.run_stats=0|1
compute image statistics. Default is 1.sync_s3_to_local=0|1
In case of using s3 bucket sync s3 to local folder to improve performance. Assumes there is enough local disk space to contain the dataDefault is 0.\
-
distance
str - Distance metric for the Nearest Neighbors algorithm. The default is 'cosine' which works well in most cases.
For nn_provider='nnf' the following distance metrics are supported.
When using nnf_mode='Flat': 'cosine', 'euclidean', 'l1','linf','canberra','braycurtis','jensenshannon' are supported.
Otherwise 'cosine' and 'euclidean' are supported. -
threshold
float - Similarity measure in the range 0->1, where 1 is totally identical, 0.98 and above is almost identical. -
lower_threshold
float - Similarity percentile measure to outline images that are far away (outliers) vs. the total distribution. (means 5% out of the total similarities computed). -
model_path
str - Optional location of ONNX model file, should not be used. -
version
bool - Print out the version number. This function takes no argument. -
nearest_neighbors_k
int - For each image, how many similar images to look for. -
d
int - Length of the feature vector. On default it is 576. When you use your own onnx model, change this parameter to the output model feature vector length.run_mode (int):
0
(the default) does the feature extraction and NN embedding to compute all pairs similarities.
It uses the input_dir command line argument for finding the directory to run on (or a list of files to run on).
The features are extracted and saved into the working_dir path (the default features out file nme is
features.dat
in the same folder for storing the numpy features and features.dat.csv for storing the
image file names corresponding to the numpy features).
For larger dataset it may be wise to split the run into two, to make sure intermediate results are stored in case you encounter an error.1
computes the extracted features and stores them, does not compute the NN embedding.
For large datasets, it is possible to run on a few computing nodes, to extract the features, in parallel.
Use the min_offset and max_offset flags to allocate a subset of the images for each computing node.
Offsets start from 0 to n-1 where n is the number of images in the input_dir folder.2
reads a stored feature file and computes the NN embedding to provide similarities.
The input_dir param is ignored, and the work_dir is used to point to the numpy feature file. (Give a full path and filename).3
Reads the NN model stored by nnf.index from the work_dir and computes all pairs similarity on all images
given by the test_dir parameter. input_dir should point to the location of the train data.
This mode is used for scoring similarities on a new test dataset given a precomputed simiarity index on a train dataset.4
reads the NN model stored bynnf.index
from thework_dir
and computes all pairs similarity on pre extracted feature vectors computer by run_mode=1.\ -
min_offset
unsigned long long - Optional min offset to start iterating on the full file list. -
max_offset
unsigned long long - Optional max offset to start iterating on the full file list. -
nnf_mode
str - When nn_provider='nnf' selects the nnf model mode.
default is HSNW32. More accurate is Flat. -
nnf_param
str - Selects assigns optional parameters.
num_em_iter
, number of KMeans EM iterations to run. Default is 20.
num_clusters
, number of KMeans clusters to use. Default is 100. -
bounding_box
str - Optional bounding box to crop images, given as bounding_box='row_y=xx,col_x=xx,height=xx,width=xx'. This defines a global bounding box to be used for all images.
Beta release features (need to sign up at https://visual-layer.com): Tt is possible to set bounding_box='face' to crop the face from the image (in case a face is present).
In addition, you can set bounding_box='yolov5s' and we will run yolov5s to create and crop bounding boxes on your data. (We do not host this model, it is downloaded from the relevant github proejct).
For the face/yolov5 crop the margin around the face is defined by turi_param='augmentation_horiz=0.2,augmentation_vert=0.2' where 0.2 mean 20% additional margin around the face relative to the width and height respectively.
It is possible to change the margin, the lowest value is 0 (no margin) and upper allowed value is 1. Default is 0.2. -
batch_size
int - Optional batch size when computing inference. Allowed values < 200. Note: batch_size > 1 is enabled in the enterprise version. -
resume
int - Optional flag to resume from a previous run. -
high_accuracy
bool - Compute a more accurate model. Runtime is increased about 15% and feature vector storage size/ memory is increased about 60%. The upside is the model can distinguish better of minute details in images with many objects.
Returns:
ret
int - Status code 0 = success, 1 = error.
run_on_webdataset
def run_on_webdataset(input_dir='',
work_dir='.',
test_dir='',
compute='cpu',
verbose=False,
num_threads=-1,
num_images=0,
turi_param='nnmodel=0',
distance='cosine',
threshold=0.9,
lower_threshold=0.05,
model_path=model_path_full,
license='',
version=False,
nearest_neighbors_k=2,
d=576,
nn_provider='nnf',
min_offset=0,
max_offset=0,
nnf_mode="HNSW32",
nnf_param="",
bounding_box="",
batch_size=1)
Run the FastDup software on a web dataset.
This run is composed of two stages. First extract all feature vectors using run_mode=1, then run the nearest neighbor model using run_mode=2.
Make sure that work_dir has enough free space for extracting tar files. Tar files are extracted temporarily into work_dir/tmp folder.
You can control the free space using the flags turi_param='delete_tar=1|0' and delete_img='1|0'. When delete_tar=1 the tars are processed one by one and deleted after processing.
When delete_img=1 the images are processed one by one and deleted after processing.
load_binary_feature
def load_binary_feature(filename, d=576)
Python function for loading the stored binary features written by fastdup and their matching filenames and analyzing them in Python.
Arguments:
filename
str - The binary feature file locationd
int - Feature vector length
Returns:
filenames
list - A list of with all image file names of length X.np_array
np.array - An np matrix of shape rows x d cols (default d is 576). Each row conform to feature vector os a single image.
Example:
import fastdup
file_list, mat_features = fastdup.load_binary(FILENAME_FEATURES)
save_binary_feature
def save_binary_feature(save_path, filenames, np_array)
Function for saving data to be used by fastdup. Given a list of images and their matching feature vectors in a numpy array,
function saves data in a format readable by fastdup. This saves the image extraction step, to be used with run_mode=1 namely perform
nearest neighbor model on the feature vectors.
Arguments:
save_path
str - Working folder to save the files tofilenames
list - A list of file location of the images (absolute paths) of length n imagesnp_array
np.array - Numpy array of size n x d. Each row is a feature vector of one file.
Returns:
ret
int - 0 in case of success, otherwise 1
create_duplicates_gallery
def create_duplicates_gallery(similarity_file,
save_path,
num_images=20,
descending=True,
lazy_load=False,
get_label_func=None,
slice=None,
max_width=None,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
input_dir=None,
work_dir=None,
threshold=None,
**kwargs)
Function to create and display a gallery of duplicate/near duplicate images as computed by the similarity metric.
In addition, it is possible to compute hierarchical gallery of duplicate/near duplicate clusters. For doing so need to
(A) Run fastdup to compute similarity on work_dir
(B) Run connected components on the work_dir saving the component results to save_path (need to run with lazy_load=True)
(C) Run create_duplicates_gallery() on the components to find pairs of similar components. Point the similarity_file to similarity_hierarchical_XX.csv file where XX is the
connected components threshold (ccthreshold=XX).
Example:
import fastdup
fastdup.run('input_folder', 'output_folder')
fastdup.create_duplicates_gallery('output_folder', save_path='.', get_label_func = lambda x: x.split('/')[1], slice='hamburger')
Regarding get_label_func, this example assumes that the second folder name is the class name for example my_data/hamburger/image001.jpg. You can change it to match your own labeling convention.
Arguments:
similarity_file
str - csv file with the computed similarities by the fastdup tool, or a work_dir path, or a pandas dataframe containing the similarities.save_path
str - output folder location for the visualsnum_images
int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.descending
boolean - If False, print the similarities from the least similar to the most similar. Default is True.lazy_load
boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.slice
str - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
slice could be a specific label i.e. slice='haumburger' and in that case only similarities between hamburger and other classes are presented.
Two reserved arguments for slice are "diff" and "same". When using "diff" the report only shows similarities between classes. When using "same" the report will show only similarities inside same class.
Note that when using slice, the function get_label_function should be implmeneted.max_width
int - Optional parameter to set the max width of the gallery.get_bounding_box_func
callable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_func
callable - Optional parameter to allow changing the presented filename into another string.
The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_func
callable - Optional parameter to allow adding additional column to the reportinput_dir
str - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'work_dir
str - Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a duplicate file paththreshold
float - Optional parameter to specify the threshold for similarity score to be considered as duplicate. Values above the threshold will be considered as duplicate.
Allowed values are between 0 and 1.save_artifacts
boolean - Optional parameter to allow saving the intermediate artifacts (raw images, csv with results) to the output folder
create_duplicate_videos_gallery
def create_duplicate_videos_gallery(similarity_file,
save_path,
num_images=20,
descending=True,
lazy_load=False,
get_label_func=None,
slice=None,
max_width=None,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
input_dir=None,
work_dir=None,
threshold=None,
**kwargs)
Function to create and display a gallery of duplicaate videos computed by the similarity metrics
Example:
import fastdup
fastdup.run('input_folder', 'output_folder', run_mode=1) # extract frames from videos
fastdup.run('input_folder', 'output_folder', run_mode=2) # run fastdup
fastdup.create_duplicates_videos_gallery('output_folder', save_path='.')
Arguments:
similarity_file
str - csv file with the computed similarities by the fastdup tool, or a work_dir path, or a pandas dataframe containing the similarities.save_path
str - output folder location for the visualsnum_images
int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.descending
boolean - If False, print the similarities from the least similar to the most similar. Default is True.lazy_load
boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.slice
str - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
slice could be a specific label i.e. slice='haumburger' and in that case only similarities between hamburger and other classes are presented.
Two reserved arguments for slice are "diff" and "same". When using "diff" the report only shows similarities between classes. When using "same" the report will show only similarities inside same class.
Note that when using slice, the function get_label_function should be implmeneted.max_width
int - Optional parameter to set the max width of the gallery.get_bounding_box_func
callable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_func
callable - Optional parameter to allow changing the presented filename into another string.
The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_func
callable - Optional parameter to allow adding additional column to the reportinput_dir
str - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'work_dir
str - Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a duplicate file paththreshold
float - Optional parameter to specify the threshold for similarity score to be considered as duplicate. Values above the threshold will be considered as duplicate.
Allowed values are between 0 and 1.save_artifacts
boolean - Optional parameter to allow saving the intermediate artifacts (raw images, csv with results) to the output folder
create_outliers_gallery
def create_outliers_gallery(outliers_file,
save_path,
num_images=20,
lazy_load=False,
get_label_func=None,
how='one',
slice=None,
max_width=None,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
input_dir=None,
work_dir=None,
**kwargs)
Function to create and display a gallery of images computed by the outliers metrics.
Outliers are computed using the fastdup tool, by embedding each image to a short feature vector, finding top k similar neighbors
and finding images that are further away from all other images, i.e. outliers.
On default fastdup saves the outliers into a file called outliers.csv
inside the work_dir
folder.
It is possible to load this file using pandas to get the list of outlir images.
Note that the number of images included in the outliers file depends on the lower_threshold
parameter in the fastdup run. This command line argument is a percentile
i.e. 0.05 means top 5% of the images that are further away from the rest of the images are considered outliers.
Arguments:
outliers_file
str - csv file with the computed outliers by the fastdup tool, or a work_dir path, or a pandas dataframe contraining the outlierssave_path
str - output folder location for the visualsnum_images
int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.lazy_load
boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.how
str - Optional outlier selection method. one = take the image that is far away from any one image (but may have other images close to it).
all = take the image that is far away from all other images. Default is one.slice
str - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.max_width
int - Optional parameter to set the max width of the gallery.get_bounding_box_func
callable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_func
callable - Optional parameter to allow changing the presented filename into another string.
The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_func
callable - Optional parameter to allow adding additional column to the reportinput_dir
str - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'work_dir
str - Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a outliers file path
create_components_gallery
def create_components_gallery(work_dir,
save_path,
num_images=20,
lazy_load=False,
get_label_func=None,
group_by='visual',
slice=None,
max_width=None,
max_items=None,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
threshold=None,
metric=None,
descending=True,
min_items=None,
keyword=None,
input_dir=None,
**kwargs)
Function to create and display a gallery of images for the largest graph components
Arguments:
work_dir
str - path to fastdup work_dir, or a path to connected component csv file. Altenatively dataframe with connected_compoennts.csv content from previous fastdup run.save_path
str - output folder location for the visualsnum_images
int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.lazy_load
boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.group_by
str - [visual|label]. Group the report using the visual properties of the image or using the labels of the images. Default is visual.slice
str or list - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.max_width
int - Optional parameter to set the max html width of images in the gallery. Default is None.max_items
int - Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.get_bounding_box_func
callable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_func
callable - Optional parameter to allow changing the presented filename into another string. The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_func
callable - Optional parameter to allow adding more information to the report.threshold
float - Optional parameter to set the treshold for chosing components. Default is None.metric
str - Optional parameter to set the metric to use (like blur) for chose components. Default is None.descending
boolean - Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.min_items
int - Optional parameter to select components with min_items or more items. Default is None.keyword
str - Optional parameter to select components with keyword asa subset of the label. Default is None.input_dir
str - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'kwargs
dict - Optional parameter to pass additional parameters to the function.split_sentence_to_label_list
boolean - Optional parameter to split the label into a list of labels. Default is False.limit_labels_printed
int - Optional parameter to limit the number of labels printed in the html report. Default is max_items.nrows
int - limit the number of read rows for debugging purposes of the reportsave_artifacts
bool - Optional param to save intermediate artifacts like image paths used for generating the component
Returns:
ret
int - 0 in case of success, otherwise 1
create_component_videos_gallery
def create_component_videos_gallery(work_dir,
save_path,
num_images=20,
lazy_load=False,
get_label_func=None,
group_by='visual',
slice=None,
max_width=None,
max_items=None,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
threshold=None,
metric=None,
descending=True,
min_items=None,
keyword=None,
input_dir=None,
**kwargs)
Function to create and display a gallery of similar videos based on the graph components
Arguments:
work_dir
str - path to fastdup work_dirsave_path
str - output folder location for the visualsnum_images
int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.lazy_load
boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.group_by
str - [visual|label]. Group the report using the visual properties of the image or using the labels of the images. Default is visual.slice
str or list - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.max_width
int - Optional parameter to set the max html width of images in the gallery. Default is None.max_items
int - Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.get_bounding_box_func
callable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_func
callable - Optional parameter to allow changing the presented filename into another string. The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_func
callable - Optional parameter to allow adding more information to the report.threshold
float - Optional parameter to set the treshold for chosing components. Default is None.metric
str - Optional parameter to set the metric to use (like blur) for chose components. Default is None.descending
boolean - Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.min_items
int - Optional parameter to select components with min_items or more items. Default is None.keyword
str - Optional parameter to select components with keyword asa subset of the label. Default is None.input_dir
str - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'
Returns:
ret
int - 0 in case of success, otherwise 1
create_kmeans_clusters_gallery
def create_kmeans_clusters_gallery(work_dir,
save_path,
num_images=20,
lazy_load=False,
get_label_func=None,
slice=None,
max_width=None,
max_items=None,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
threshold=None,
metric=None,
descending=True,
min_items=None,
keyword=None,
input_dir=None,
**kwargs)
Function to visualize the kmeans clusters.
Arguments:
work_dir
str - path to fastdup work_dirsave_path
str - output folder location for the visualsnum_images
int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.lazy_load
boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.slice
str or list - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.max_width
int - Optional parameter to set the max html width of images in the gallery. Default is None.max_items
int - Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.get_bounding_box_func
callable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_func
callable - Optional parameter to allow changing the presented filename into another string. The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_func
callable - Optional parameter to allow adding more information to the report.threshold
float - Optional parameter to set the treshold for chosing components. Default is None.metric
str - Optional parameter to set the metric to use (like blur) for chose components. Default is None.descending
boolean - Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.min_items
int - Optional parameter to select components with min_items or more items. Default is None.keyword
str - Optional parameter to select components with keyword asa subset of the label. Default is None.input_dir
str - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'
Returns:
ret
int - 0 in case of success, otherwise 1
remove_duplicates
def remove_duplicates(input_dir, work_dir=None, distance=0.96, dry_run=True)
function to automate deletion of duplicate images, where similarity is above the distance
param.
Arguments:
input_dir
see fastdup.run()work_dir
see fastdup.run()distance
delete only components which has similarity larger thanmin_distance
.dry_run
bool - if True does not delete but print the rm commands used, otherwise deletes
Returns:ret
list - list of deleted files
delete_components
def delete_components(top_components, to_delete=None, min_distance=0.96, how='one', dry_run=True)
function to automate deletion of duplicate images using the connected components analysis.
Example:
import fastdup
fastdup.run('/path/to/data', '/path/to/output')
top_components = fastdup.find_top_components('/path/to/output')top_components = top_components[top_components['distance'] > 0.99] # remove any components which are similar more than 0.99
delete_components(top_components, None, how = 'one', dry_run = False)
Arguments:
top_components
pd.DataFrame - largest components as found by the function find_top_components().to_delete
list - a list of integer component ids to delete. On default None which means delete duplicates from all components.min_distance
delete only components which has similarity larger thanmin_distance
.how
int - either 'all' (deletes all the component) or 'one' (leaves one image and delete the rest of the duplicates)dry_run
bool - if True does not delete but print the rm commands used, otherwise deletes
Returns:
ret
list - list of deleted files
delete_components_by_label
def delete_components_by_label(top_components_file,
min_items=10,
min_distance=0.96,
how='one',
dry_run=True)
function to automate deletion of duplicate images using the connected components analysis.
Arguments:
top_components
pd.DataFrame - largest components as found by the function find_top_components().to_delete
list - a list of integer component ids to deletemin_distance
delete only components which has similarity larger thanmin_distance
.how
int - either 'all' (deletes all the component) or 'majority' (leaves one image with the dominant label count and delete the rest)dry_run
bool - if True does not delete but print the rm commands used, otherwise deletes
Returns:
ret
list - list of deleted files
delete_or_retag_stats_outliers
def delete_or_retag_stats_outliers(stats_file,
metric,
filename_col='filename',
label_col=None,
lower_percentile=None,
upper_percentile=None,
lower_threshold=None,
upper_threshold=None,
get_reformat_filename_func=None,
dry_run=True,
how='delete',
save_path=None,
work_dir=None)
function to automate deletion of outlier files based on computed statistics.
Example:
import fastdup
fastdup.run('/my/data/", work_dir="out")
delete 5% of the brightest images and delete 2% of the darkest images
fastdup.delete_or_retag_stats_outliers("out", metric="mean", lower_percentile=0.05, dry_run=False)
It is recommended to run with dry_run=True first, to see the list of files deleted before actually deleting.
Example:
This example first find wrong labels using similarity gallery and then deletes anything with score < 51.
Score is in range 0-100 where 100 means this image is similar only to images from the same class label.
Score 0 means this image is only similar to images from other class labels.
import fastdup
df2 = create_similarity_gallery(..., get_label_func=...)
fastdup.delete_or_retag_stats_outliers(df2, metric='score', filename_col = 'from', lower_threshold=51, dry_run=True)
Note
- it is possible to run with bothlower_percentile
andupper_percentile
at once. It is not possible to run withlower_percentile
andlower_threshold
at once since they may be conflicting.
Arguments:
stats_file (str):
- folder pointing to fastdup workdir or
- file pointing to work_dir/atrain_stats.csv file or
- pandas DataFrame containing list of files giveb in the filename_col column and a metric column.
metric
str - statistic metric, should be one of "blur", "mean", "min", "max", "stdv", "unique", "width", "height", "size"filename_col
str - column name in the stats_file to use as the filenamelower_percentile
float - lower percentile to use for the threshold. Values are 0->1, where 0.05 means remove 5% of the lowest values.upper_percentile
float - upper percentile to use for the threshold. Values are 0->1, where 0.95 means remove 5% of the upper values.lower_threshold
float - lower threshold to use for the threshold. Only used if lower_percentile is None.upper_threshold
float - upper threshold to use for the threshold. Only used if upper_percentile is None.get_reformat_filename_func
callable - Optional parameter to allow changing the filename into another string. Useful in the case fastdup was run on a different folder or machine and you would like to delete files in another folder.dry_run
bool - if True does not delete but print the rm commands used, otherwise deleteshow
str - either 'delete' or 'move' or 'retag'. In case of retag allowed value is retag=labelImg or retag=cvatsave_path
str - optional. In case of a folder and how == 'retag' the label files will be moved to this folder.work_dir
str - optional. In case of stats dataframe, point to fastdup work_dir.
Returns:
ret
list - list of deleted files (or moved or retagged files)
export_to_tensorboard_projector
def export_to_tensorboard_projector(work_dir,
log_dir,
sample_size=900,
sample_method='random',
with_images=True,
get_label_func=None,
d=576,
file_list=None)
Export feature vector embeddings to be visualized using tensorboard projector app.
Example:
import fastdup
fastdup.run('/my/data/', work_dir='out')
fastdup.export_to_tensorboard_projector(work_dir='out', log_dir='logs')
After data is exporeted run tensorboard projector
%load_ext tensorboard
%tensorboard --logdir=logs
Arguments:
work_dir
str - work_dir where fastdup results are storedlog_dir
str - output dir where tensorboard will read fromsample_size
int - how many images to view. Default is 900.sample_method
str - how to sample, currently 'random' is supported.with_images
bool - add images to the visualization (default True)get_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.d
int - dimension of the embedding vector. Default is 576.file_list
list - Optional parameter to specify a list of files to be used for the visualization. If not specified, filenames are taken from the work_dir/atrain_features.dat.csv fileNote
- be careful here as the order of the file_list matters, need to keep the exact same order as the atrain_features.dat.csv file!
Returns:
ret
int - 0 in case of success, 1 in case of failure
generate_sprite_image
def generate_sprite_image(img_list,
sample_size,
log_dir,
get_label_func=None,
h=0,
w=0,
alternative_filename=None,
alternative_width=None,
max_width=None)
Generate a sprite image of images for tensorboard projector. A sprite image is a large image composed of grid of smaller images.
Arguments:
img_list
list - list of image filenames (full path)sample_size
int - how many images in to plotlog_dir
str - directory to save the sprite imageget_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.h
int - optional requested hight of each subimagew
int - optional requested width of each subimagealternative_filename
str - optional parameter to save the resulting image to a different namealternative_width
int - optional parameter to control the number of images per rowmax_width
int - optional parameter to control the rsulting width of the image
Returns:
path
str - path to sprite imagelabels
list - list of labels
find_top_components
def find_top_components(work_dir,
get_label_func=None,
group_by='visual',
slice=None,
threshold=None,
metric=None,
descending=True,
min_items=None,
max_items=None,
keyword=None,
save_path=None,
comp_type="component",
**kwargs)
Function to find the largest components of duplicate images
Arguments:
work_dir
str - working directory where fastdup.run was run.get_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.group_by
str - optional parameter to group by 'visual' or 'label'. When grouping by visual fastdup aggregates visually similar images together.
When grouping by 'label' fastdup aggregates images with the same label together.slice
str - optional parameter to slice the results by a specific label. For example, if you want to slice by 'car' then pass 'car' as the slice parameter.threshold
float - optional threshold to select only distances larger than the tresholdmetric
str - optional metric to sort by. Valid values are mean,min,max,unique,blur,sizedescending
bool - optional value to sort the components, default is Truemin_items
int - optional value, select only components with at least min_itemsmax_items
int - optional value, select only components with at most max_itemskeyword
str - optional, select labels with keyword value insidesave_path
str - optional, save pathcomp_type
str - optional, either component or cluster
Returns:
df
pd.DataFrame - of top components. The column component_id includes the component name.
The column files includes a list of all image files in this component.
init_search
def init_search(k, work_dir, d=576, model_path=model_path_full, verbose=False)
Initialize real time search and precomputed nnf data.
This function should be called only once before running searches. The search function is search().
Arguments:
k
int - number of nearest neighbors to search forwork_dir
str - working directory where fastdup.run was run.d
int - dimension of the feature vector. Default is 576.model_path
str - path to the onnx model file. Optional.verbose
bool - (Optional): True for verbose mode.
Example:
import fastdup
input_dir = "/my/input/dir"
work_dir = "/my/work/dir"
fastdup.run(input_dir, work_dir)
# point to the work_dir where fastdup was run
fastdup.init_search(10, work_dir, verbose=True)
# The below code can be executed multiple times, each time with a new searched image
df = fastdup.search("myimage.jpg", None, verbose=True)
# optional: display search output
fastdup.create_duplicates_gallery(df, ".", input_dir=input_dir)
Note: fastdup model was trained with Image resize via Resampling.NEAREST and the BGR channel swapped to RGB.
In case you use other models, need to check their requirements.
Returns:
ret
int - 0 in case of success, otherwise 1.
search
def search(img, size, verbose=0)
Search for similar images in the image database.
Arguments:
img
str - the image to search forsize
int - image size width x heightverbose
int - run in verbose mode
Returns:
ret
int - 0 = in case of success, 1 = in case of failure
The output file is created on work_dir/similrity.csv as initialized by init_search
create_stats_gallery
def create_stats_gallery(stats_file,
save_path,
num_images=20,
lazy_load=False,
get_label_func=None,
metric='blur',
slice=None,
max_width=None,
descending=False,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
input_dir=None,
work_dir=None,
**kwargs)
Function to create and display a gallery of images computed by the statistics metrics.
Supported metrics are: mean (color), max (color), min (color), stdv (color), unique (number of unique colors), bluriness (computed by the variance of the laplpacian method
see https://theailearner.com/2021/10/30/blur-detection-using-the-variance-of-the-laplacian-method/.
The metrics are created by fastdup.run() and stored into the work_dir
into a file named atrain_stats.csv
. Note that the metrics are computed
on the fly fastdup loads and resizes every image only once.
Arguments:
stats_file
str - csv file with the computed image statistics by the fastdup tool, alternatively a pandas dataframe. Default stats file is saved by fastdup.run() into the folderwork_dir
asatrain_stats.csv
.save_path
str - output folder location for the visualsnum_images
int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.lazy_load
boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.metric
str - Optional metric selection. Supported metrics are:- width - of original image before resize
- height - of original image before resize
- size - area
- file_size - file size in bytes
- blur - variance of the laplacian
- unique - number of unique colors, 0..255
- mean - mean color 0..255
- max - max color 0..255
- min - min color 0..255
Advanced metris include (for running advanced metrics, run with turi_param='run_advanced_stats=1') - contrast
- rms_contrast - square root of mean sum of stdv/mean per channel
- mean_rel_intensity_r
- mean_rel_intensity_b
- mean_rel_intensity_g
- mean_hue - transform to HSV and compute mean H
- mean_saturation - transform to HSV and compute mean S
- mean_val - transform to HSV and compute mean V
- edge_density - using canny filter
- mean_r - mean of R channel
- mean_g - mean of G channel
- mean_b - mean of B channel
slice
str - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.max_width
int - Option parameter to select the maximal image width in the reportdescending
bool - Optional parameter to control the order of the metricget_bounding_box_func
callable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_func
callable - Optional parameter to allow changing the presented filename into another string.
The input is an absolute path to the image and the output is the string to display instead of the filename.get_extra_col_func
callable - Optional parameter to allow adding extra columns to the gallery.input_dir
str - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'work_dir
str - Optional parameter to fastdup work_dir. Needed when stats file is a pd.DataFrame.
Returns:
ret
int - 0 in case of success, otherwise 1.
create_similarity_gallery
def create_similarity_gallery(similarity_file,
save_path,
num_images=20,
lazy_load=False,
get_label_func=None,
slice=None,
max_width=None,
descending=False,
get_bounding_box_func=None,
get_reformat_filename_func=None,
get_extra_col_func=None,
input_dir=None,
work_dir=None,
min_items=2,
max_items=None,
**kwargs)
Function to create and display a gallery of images computed by the similarity metric. In each table row one query image is
displayed and num_images
most similar images are displayed next to it on the right.
In case the dataset is labeled, the user can specify the label using the function get_label_func
. In this case a score
metric is computed to reflect how similar the query image to the most similar images in terms of class label.
Score 100 means that out of the top k num_images similar images, all similar images are from the same class. Score 0 means that the image is similar only to images which are from different class.
Score 50 means that the query image is similar to the same number of images from the same class and from other classes. The report is sorted by the score metric.
For high quality labeled dataset we expect the score to be high, low score may indicate class label issues.
Arguments:
similarity_file
str - csv file with the computed image statistics by the fastdup tool, or a path to the work_dir,
alternatively a pandas dataframe. In case of a pandas dataframe need to set work_dir to point to fastdup work_dir.save_path
str - output folder location for the visualsnum_images
int - Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.lazy_load
boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).get_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.slice
str - Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
A special value is 'label_score' which is used for comparing both images and labels of the nearest neighbors. The score values are 0->100 where 0 means the query image is only similar to images outside its class, 100 means the query image is only similar to images from the same class.max_width
int - Optional param to limit the image widthdescending
bool - Optional param to control the order of the metricget_bounding_box_func
callable - Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv existsget_reformat_filename_func
callable - Optional parameter to allow changing the presented filename into another string.get_extra_col_func
callable - Optional parameter to allow adding extra columns to the reportinput_dir
str - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'work_dir
str - Optional parameter to fastdup work_dir. Needed when similarity_file is a pd.DataFrame.min_items
int - Optional parameter to select components with min_items or moremax_items
int - Optional parameter to limit the number of items displayed
Returns:
ret
pd.DataFrame - similarity dataframe, for each image filename returns a list of top K similar images.
each row has the columns 'from', 'to', 'label' (optional), 'distance'
create_aspect_ratio_gallery
def create_aspect_ratio_gallery(stats_file,
save_path,
get_label_func=None,
lazy_load=False,
max_width=None,
num_images=0,
slice=None,
get_filename_reformat_func=None,
input_dir=None,
**kwargs)
Function to create and display a gallery of aspect ratio distribution.
Arguments:
stats_file
str - csv file with the computed image statistics by the fastdup tool, or work_dir path or a pandas dataframe with the stats compouted by fastdup.save_path
str - output folder location for the visualsget_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.lazy_load
boolean - If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).max_width
int - optional parameter to limit the plot image widthnum_images
int - optional number of images to compute the statistics on (default computes on all images)slice
str - optional parameter to slice the stats file based on a specific label or a list of labels.get_filename_reformat_func
callable - optional function to reformat the filename before displaying it.input_dir
str - Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'
Returns:
ret
int - 0 in case of success, otherwise 1.
export_to_cvat
def export_to_cvat(files, labels, save_path)
Function to export a collection of files that needs to be annotated again to cvat batch job format.
This creates a file named fastdup_label.zip in the directory save_path.
The files can be retagged in cvat using Tasks -> Add (plus button) -> Create from backup -> choose the location of the fastdup_label.zip file.
Arguments:
files (str):
labels (str):
save_path (str):
Returns:
ret
int - 0 in case of success, otherwise 1.
export_to_labelImg
def export_to_labelImg(files, labels, save_path)
Function to export a collection of files that needs to be annotated again to cvat batch job format.
This creates a file named fastdup_label.zip in the directory save_path.
The files can be retagged in cvat using Tasks -> Add (plus button) -> Create from backup -> choose the location of the fastdup_label.zip file.
Arguments:
files (str):
labels (str):
save_path (str):
Returns:
ret
int - 0 in case of success, otherwise 1.
top_k_label
def top_k_label(labels_col,
distance_col,
k=10,
threshold=None,
min_count=None,
unknown_class=None)
Function to classify examples based on their label using the top k nearest neighbors.
Decision is made by accounting for the majority of the neighbors.
Arguments:
labels_col
list - list of labelsdistance_col
list - list of distancesk
int - optional parameterthreshold
float - optional parameter to consder neighbors with simiarity larger than thresholdmin_count
int - optional parameter to consider only examples with at least min_count neighbors with the same labelunknown_class
- optional parameter to add decisions to unknown class in cases there is no majority
Returns:
computed label
create_knn_classifier
def create_knn_classifier(work_dir, k, get_label_func, threshold=None)
Function to create a knn classifier out of fastdup run. We assume there are existing labels to the datapoints.
Arguments:
work_dir
str - fastdup work_dir, or location of a similarity file, or a pandas DataFrame with the computed similaritiesk
int - (unused)get_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.threshold
float - optional threshold to consider neighbors with similarity larger than threshold
prediction per image to one of the given classes.
Returns:
df
pd.DataFrame - List of predictions using knn method
create_kmeans_classifier
def create_kmeans_classifier(work_dir, k, get_label_func, threshold=None)
Function to create a knn classifier out of fastdup run. We assume there are existing labels to the datapoints.
Arguments:
work_dir
str - fastdup work_dir, or location of a similarity file, or a pandas DataFrame with the computed similaritiesk
int - (unused)get_label_func
callable - optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.threshold
float - (unused)
Returns:
df
pd.DataFrame - dataframe with filename, label and predicted label. Row per each image
run_kmeans
def run_kmeans(input_dir='',
work_dir='.',
verbose=False,
num_clusters=100,
num_em_iter=20,
num_threads=-1,
num_images=0,
model_path=model_path_full,
license='',
nearest_neighbors_k=2,
d=576,
bounding_box="",
high_accuracy=False)
Run KMeans algorithm on a folder of images given by input_dir
and save the results to work_dir
.
Fastdup will extract feature vectors using the model specified by model_path
and then run KMeans to cluster the vectors.
The results will be saved to work_dir
in the following format:
kmeans_centroids.csv
: a csv file containing the centroids of the clusters.kmeans_assignments.csv
: assignment of each data point to the closet centroids (number of centroids given bynearest_neighbors_k
).
After running kmeans you can usecreate_kmeans_clusters_gallery
to view the results.
Arguments:
input_dir
str - path to the folder containing the images to be clustered. Seefastdup.run
for more details.work_dir
str - path to the folder where the results will be saved.verbose
bool - verbosity level, default Falsenum_clusters
int - Number of KMeans clusters to usenum_em_iter
int - Number of em iterationsnum_threads
int - Number of threads for performing the feature vector extractionnum_images
int - Limit the number of imagesmodel_path
str - Model path for the model to be used for feature vector extractionlicense
str - License stringnearest_neighbors_k
int - When assigning an image into a cluster, how many clusters to assign to (starting from the closest)d
int - Dimension of the feature vectorbounding_box
str - Optional bounding box see fastdup:::run for more detailshigh_accuracy
bool - Use higher accuracy model for the feature extraction
Returns:
ret
int - 0 in case of success, 1 in case of error
run_kmeans_on_extracted
def run_kmeans_on_extracted(input_dir='',
work_dir='.',
verbose=False,
num_clusters=100,
num_em_iter=20,
num_threads=-1,
num_images=0,
model_path=model_path_full,
license='',
nearest_neighbors_k=2,
d=576)
Run KMeans algorithm on a folder of extracted feature vectors (created on default when running fastdup:::run).
The results will be saved to work_dir
in the following format:
kmeans_centroids.csv
: a csv file containing the centroids of the clusters. In each row one centroid. In totalnum_clusters
rows.kmeans_assignments.csv
: assignment of each data point to the closet centroids (number of centroids given bynearest_neighbors_k
). In each row the image filename is listed, centoid id (starting from zero) and the L2 distance to the centroid.
After running kmeans you can usefastdup:::create_kmeans_clusters_gallery
to view the results.
Arguments:
input_dir
str - path to the folder containing the images to be clustered. See fastup:::run for more details.work_dir
str - path to the folder where the results will be saved.verbose
bool - verbosity level, default Falsenum_clusters
int - Number of KMeans clusters to usenum_em_iter
int - Number of em iterationsnum_threads
int - Number of threads for performing the feature vector extractionnum_images
int - Limit the number of imagesmodel_path
str - Model path for the model to be used for feature vector extractionlicense
str - License stringnearest_neighbors_k
int - When assigning an image into a cluster, how many clusters to assign to (starting from the closest)d
int - Dimension of the feature vector
Returns:
ret
int - 0 in case of success, 1 in case of error
extract_video_frames
def extract_video_frames(input_dir,
work_dir,
verbose=False,
num_threads=-1,
num_images=0,
min_offset=0,
max_offset=0,
turi_param="",
model_path=model_path_full,
d=576,
resize_video=0,
keyframes_only=1,
license="")
A function to go over a collection of videos and etract them into frames. The output is saved to the work_dir/tmp
subfolder.
Arguments:
input_dir (str):
Location of the videos to extract.
- A folder
- A remote folder (s3 or minio starting with s3:// or minio://). When using minio append the minio server name for example minio://google/visual_db/sku110k.
- A file containing absolute filenames each on its own row.
- A file containing s3 full paths or minio paths each on its own row.
- A python list with absolute filenames
- We support api/mp4 video formats.
work_dir
str - Optional path for storing intermediate files and results.verbose
boolean - Verbosity.num_threads
int - Number of threads. If no value is specified num threads is auto configured by the number of cores.num_images
unsigned long long - Number of images to run on. On default, run on all the images in the image_dir folder.turi_param
str - Optional turi parameters seperated by command. Example run: turi_param='nnmodel=0,ccthreshold=0.99'
The following parameters are supported.- nnmodel=xx, Nearest Neighbor model for clustering the features together. Supported options are 0 = brute_force (exact), 1 = ball_tree and 2 = lsh (both approximate).
- ccthreshold=xx, Threshold for running connected components to find clusters of similar images. Allowed values 0->1. The default ccthreshold is 0.96. This groups very similar images together, for example identical images or images that went
simple transformations like scaling, flip, zoom in. As higher the score the more similar images are grouped by and you will get
smaller clusters. Score 0.9 is pretty broad and will clsuter images together even if they fine details are not similar.
It is recommended to experiment with this parameter based on your dataset and then visualize the results usingfastdup.create_components_gallery()
. - run_cc=0|1 run connected components on the resulting similarity graph. Default is 1.
- run_pagerank=0|1 run pagerank on the resulting similarity graph. Default is 1.
- delete_tar=0|1 when working with tar files obtained from cloud storage delete the tar after download
- delete_img=0|1 when working with images obtained from cloud storage delete the image after download
- tar_only=0|1 run only on tar files and ignore images in folders. Default is 0.
- run_stats=0|1 compute image statistics. Default is 1.
- sync_s3_to_local=0|1 In case of using s3 bucket sync s3 to local folder to improve performance. Assumes there is enough local disk space to contain the dataDefault is 0. \
min_offset
unsigned long long - Optional min offset to start iterating on the full file list.max_offset
unsigned long long - Optional max offset to start iterating on the full file list.resize_video
int - 0 = do not resize video, 1 = resize video based on the model_path dimensionskeyframes_only
int - 0 = extract all frames, 1 = extract only keyframesmodel_path
str - optional string to point to alternatiuve onnx or ort modeld
int - output feature vector for model
Returns:
ret
int - Status code 0 = success, 1 = error.
Updated 2 months ago