First, let's install the neccessary packages:
- fastdup - To analyze issues in the dataset.
- TIMM (PyTorch Image Models) - To acquire pre-trained models.
pip install -Uq fastdup timm
Now, test the installation. If there's no error message, we are ready to go.
import fastdup fastdup.__version__
In this notebook, we will the Price Match Guarantee Dataset from Shopee from Kaggle.
The dataset consists of images from users who sell products on the Shopee online platform.
Download the dataset here, unzip, and place it in the current directory.
Here's a snapshot showing some of the images from the dataset.
There are over a thousand models on TIMM. Let's list down models that match the keyword
import timm timm.list_models("*dino*", pretrained=True)
['resmlp_12_224.fb_dino', 'resmlp_24_224.fb_dino', 'vit_base_patch8_224.dino', 'vit_base_patch14_dinov2.lvd142m', 'vit_base_patch16_224.dino', 'vit_giant_patch14_dinov2.lvd142m', 'vit_large_patch14_dinov2.lvd142m', 'vit_small_patch8_224.dino', 'vit_small_patch14_dinov2.lvd142m', 'vit_small_patch16_224.dino']
Now, pick a model of your choice. For demonstration, we will go with a relatively new model
vit_small_patch14_dinov2.lvd142m from MetaAI.
DINOv2 models produce high-performance visual features that can be directly employed with classifiers as simple as linear layers on a variety of computer vision tasks; these visual features are robust and perform well across domains without any requirement for fine-tuning. Read more about DINOv2 here.
It makes sense for us to use DINOv2 as a model to create an embedding of the dataset.
Loading TIMM models in fastdup is seamless with the
TimmEncoder wrapper class. This ensures all TIMM models can be used in fastdup to compute the embeddings of your dataset.
Under the hood, the wrapper class loads the model from TIMM excluding the final classification layer.
Next, let's load the DINOv2 model using the
from fastdup.embeddings_timm import TimmEncoder timm_model = TimmEncoder('vit_small_patch14_dinov2.lvd142m')
Here are other the parameters for
model_name(str): The name of the model architecture to use.
num_classes(int): The number of classes for the model. Use num_features=0 to exclude the last layer. Default:
pretrained(bool): Whether to load pretrained weights. Default:
device(str): Which device to load the model on. Choices: "cuda" or "cpu". Default:
torch_compile(bool): Whether to use
torch.compileto optimize model. Default
To start computing embeddings, specify the directory where the images are stored.
Once done, the embeddings are stored in a folder named
saved_embeddings in the current directory as a
numpy array with the appropriate model name.
For this example the file name is
You can optionally specify the
save_dirparameter to specfify a directory to save the embeddings.
Now what's left is to load the embeddings into fastdup and run an analysis to surface dataset issues.
fd = fastdup.create(input_dir=timm_model.img_folder) fd.run(annotations=timm_model.file_paths, embeddings=timm_model.embeddings)
You can use all of fastdup gallery methods to view duplicates, clusters, etc.
Let's view the image clusters.
And duplicates gallery.
In this tutorial, we showed how you can compute embeddings on your dataset using TIMM and run fastdup on top of it to surface dataset issues.
Questions about this tutorial? Reach out to us on our Slack channel!
The team behind fastdup also recently launched VL Profiler, a no-code cloud-based platform that lets you leverage fastdup in the browser.
VL Profiler lets you find:
- Non-useful images.
Here's a highlight of the issues found in the RVL-CDIP test dataset on the VL Profiler.
Use VL Profiler for free to analyze issues on your dataset with up to 1,000,000 images.
Not convinced yet?
Interact with a collection of datasets like ImageNet-21K, COCO, and DeepFashion here.
No sign-ups needed.
Updated about 1 month ago