Zero Shot Visual Data Enrichment

Enrich your visual data with zero-shot models such as Recognize Anything, Grounding DINO, Segment Anything and more.

Open in Colab Kaggle

fastdup provides a convenient an enrichment API that let's you leverage the capabilities of these models in just a few lines of code.

In this post, we show an end-to-end example of how you can enrich the metadata of your visual using open-source zero-shot models such as Recognize Anything, Grounding DINO, and Segment Anything.

Installation

First, let's install the necessary packages:

Run the following to install all the above packages.

pip install -Uq fastdup mmengine mmdet groundingdino-py git+https://github.com/xinyu1205/recognize-anything.git git+https://github.com/facebookresearch/segment-anything.git gdown

Test the installation. If there's no error message, we are ready to go.

import fastdup fastdup.__version__
'1.51'

Download Dataset

Download the coco-minitrain dataset - a curated mini training set consisting of 20% of COCO 2017 training dataset. The coco-minitrain consists of 25,000 images and annotations.

Load Images

First, let's load the dataset from the coco-minitrain dataset folder into a DataFrame.

gdown --fuzzy https://drive.google.com/file/d/1iSXVTlkV1_DhdYpVDqsjlT4NJFQ7OkyK/view unzip -qq coco_minitrain_25k.zip

Zero-Shot Classification with RAM and Tag2Text

Within fastdup you can readily use the zero-shot classifier models such as Recognize Anything Model (RAM) and Tag2Text. Both Tag2Text and RAM exhibit strong recognition ability.

  • RAM is an image tagging model, which can recognize any common category with high accuracy. Outperforms CLIP and BLIP.
  • Tag2Text is a vision-language model guided by tagging, which can support caption, retrieval, and tagging.
Output comparison of BLIP, Tag2Text and RAM. Source - GitHub [repo](https://github.com/xinyu1205/recognize-anything).

Output comparison of BLIP, Tag2Text and RAM. Source - GitHub repo.

1. Inference on a single image

We can use these models in fastdup in a few lines of code.

Let's suppose we'd like to run an inference on the following image.

from IPython.display import Image Image("coco_minitrain_25k/images/val2017/000000181796.jpg")

We can just import the RecognizeAnythingModel and run an inference.

from fastdup.models_ram import RecognizeAnythingModel model = RecognizeAnythingModel() result = model.run_inference("coco_minitrain_25k/images/val2017/000000181796.jpg")

Let's inspect the results.

print(result)
bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass

👍

Tip

As shown above, the model outputs all associated tags with the query image.

But what if you have a collection of images and would like to run zero-shot classification on all of them? fastdup provides a convenient fd.enrich API to for convenience.

2. Inference on a bulk of images

We provide a convenient API fd.enrich to enrich the metadata of the images loaded into a DataFrame.

Let's first load the images from the folder into a DataFrame

import pandas as pd from fastdup.utils import get_images_from_path fd = fastdup.create(input_dir='./coco_minitrain_25k') filenames = get_images_from_path(fd.input_dir) df = pd.DataFrame(filenames, columns=["filename"])

Here's a DataFrame with images loaded from the folder.

filename
0 coco_minitrain_25k/images/val2017/000000314182.jpg
1 coco_minitrain_25k/images/val2017/000000531707.jpg
2 coco_minitrain_25k/images/val2017/000000393569.jpg
3 coco_minitrain_25k/images/val2017/000000001761.jpg
4 coco_minitrain_25k/images/val2017/000000116208.jpg
5 coco_minitrain_25k/images/val2017/000000581781.jpg
6 coco_minitrain_25k/images/val2017/000000449579.jpg
7 coco_minitrain_25k/images/val2017/000000200152.jpg
8 coco_minitrain_25k/images/val2017/000000232563.jpg
9 coco_minitrain_25k/images/val2017/000000493864.jpg
10 coco_minitrain_25k/images/val2017/000000492362.jpg
11 coco_minitrain_25k/images/val2017/000000031217.jpg
12 coco_minitrain_25k/images/val2017/000000171050.jpg
13 coco_minitrain_25k/images/val2017/000000191288.jpg
14 coco_minitrain_25k/images/val2017/000000504074.jpg
15 coco_minitrain_25k/images/val2017/000000006763.jpg
16 coco_minitrain_25k/images/val2017/000000313588.jpg
17 coco_minitrain_25k/images/val2017/000000060090.jpg
18 coco_minitrain_25k/images/val2017/000000043816.jpg
19 coco_minitrain_25k/images/val2017/000000009400.jpg

Running a zero-shot recognition on the DataFrame is as easy as:

NUM_ROWS_TO_ENRICH = 10 # for demonstration, only run on 10 rows. df = fd.enrich(task='zero-shot-classification', model='recognize-anything-model', # specity model input_df=df, # the DataFrame of image files to enrich. input_col='filename', # the name of the filename column. num_rows=NUM_ROWS_TO_ENRICH, # number of rows in the DataFrame to enrich. device="cuda" # run on CPU or GPU. )

As a result, an additional column 'ram_tags' is appended into the DataFrame listing all the relevant tags for the corresponding image.

filename ram_tags
0 coco_minitrain_25k/images/val2017/000000314182.jpg appetizer . biscuit . bowl . broccoli . cream . carrot . chip . container . counter top . table . dip . plate . fill . food . platter . snack . tray . vegetable . white . yoghurt
1 coco_minitrain_25k/images/val2017/000000531707.jpg bench . black . seawall . coast . couple . person . sea . park bench . photo . sit . water
2 coco_minitrain_25k/images/val2017/000000393569.jpg bathroom . bed . bunk bed . child . doorway . girl . person . laptop . open . read . room . sit . slide . toilet bowl . woman
3 coco_minitrain_25k/images/val2017/000000001761.jpg plane . fighter jet . bridge . cloudy . fly . formation . jet . sky . water
4 coco_minitrain_25k/images/val2017/000000116208.jpg bacon . bottle . catch . cheese . table . dinning table . plate . food . wine . miss . pie . pizza . platter . sit . topping . tray . wine bottle . wine glass
5 coco_minitrain_25k/images/val2017/000000581781.jpg banana . bin . bundle . crate . display . fruit . fruit market . fruit stand . kiwi . market . produce . sale
6 coco_minitrain_25k/images/val2017/000000449579.jpg ball . catch . court . man . play . racket . red . service . shirt . stretch . swing . swinge . tennis . tennis court . tennis match . tennis player . tennis racket . woman
7 coco_minitrain_25k/images/val2017/000000200152.jpg attach . building . christmas light . flag . hang . light . illuminate . neon light . night . night view . pole . sign . traffic light . street corner . street sign
8 coco_minitrain_25k/images/val2017/000000232563.jpg catch . person . man . pavement . rain . stand . umbrella . walk
9 coco_minitrain_25k/images/val2017/000000493864.jpg beach . carry . catch . coast . man . sea . sand . smile . stand . surfboard . surfer . wet . wetsuit

Zero-Shot Detection with Grounding DINO

Apart from classification models, fastdup also supports zero-shot detection models like Grounding DINO (and more to come).

Grounding DINO is a powerful open-set zero-shot detection model. It accepts image-text pairs as inputs and outputs a bounding box.

1. Inference on single image

fastdup provides an easy way to load the Grounding DINO model and run an inference.

Let's suppose we have the following image and would like to run an inference with the Grounding DINO model.

2. Inference on a DataFrame of images

To run the enrichment on a DataFrame, use the .enrich method and specify model=grounding-dino. By default fastdup loads the smaller variant (Swin-T) backbone for enrichment.

Also specify the DataFrame to run the enrichment on and the name of the column as the input to the Grounding DINO model. In this example, we take the text prompt from the ram_tags column which we have computed earlier.

3. Searching for Specific Objects with Custom Text Prompt

Let's suppose you'd like to search for specific objects in your dataset, you can create a column in the DataFrame specifying the objects of interest and run the .enrich method.

Zero-Shot Segmentation with SAM

In addition to the zer-shot classification and detection modes, fastdup also supports zero-shot segmentation using the Segment Anything Model (SAM) from MetaAI.

SAM produces high quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image.

1. Inference on a single image

To run an inference using the SAM model, import the SegmentAnythingModel class and provide an image-bounding box pair as the input.

2. Inference on a DataFrame of images

Similar to all previous examples, you can use the enrich method to add masks to your DataFrame of images.

In the following code snippet, we load the SAM model and specify input_col='grounding_dino_bboxes' to allow SAM to use the bounding boxes as inputs.

Convert Annotations to COCO Format

Once the enrichment is complete, you can also conveniently export the DataFrame into the COCO .json annotation format. For now, only the bounding boxes and labels are exported. Masks will be added in a future release.

Run fastdup

You can optionally analyze the exported annotations in fastdup to evalute the quality of the annotations.

Visualize

You can use all of fastdup gallery methods to view duplicates, clusters, etc.

Let's view the image clusters.

Wrap Up

In this tutorial, we showed how you can run zero-shot models to enrich your dataset.

Questions about this tutorial? Reach out to us on our Slack channel!

VL Profiler - A faster and easier way to diagnose and visualize dataset issues

The team behind fastdup also recently launched VL Profiler, a no-code cloud-based platform that lets you leverage fastdup in the browser.

VL Profiler lets you find:

  • Duplicates/near-duplicates.
  • Outliers.
  • Mislabels.
  • Non-useful images.

Here's a highlight of the issues found in the RVL-CDIP test dataset on the VL Profiler.

👍

Free Usage

Use VL Profiler for free to analyze issues on your dataset with up to 1,000,000 images.

Get started for free.

Not convinced yet?

Interact with a collection of datasets like ImageNet-21K, COCO, and DeepFashion here.

No sign-ups needed.


Did this page help you?