This notebook is Part 2 of the dataset enrichment notebook series where we utilize various zero-shot models to enrich datasets.

Part 1 - Dataset Enrichment with Zero-Shot Classification Models
Part 2 - Dataset Enrichment with Zero-Shot Detection Models
Part 3 - Dataset Enrichment with Zero-Shot Segmentation Models

If you haven't checked out Part 1, we highly encourage you to go through it first before proceeding with this notebook.

👍
Purpose
In this notebook, we show an end-to-end example of how you can enrich the metadata of your visual using open-source zero-shot models such Grounding DINO using the output we obtained from Part 1.
By the end of the notebook, you'll learn how to:

Install and load the Grounding DINO in fastdup.

Enrich your dataset using bounding boxes and labels generated by the Grounding DINO model.

Run inference using SAM on a single iamge.

Specify custom prompt to search for object of interest in your dataset.

Export the enriched dataset into COCO .json format.

Installation

First, let's install the necessary packages:

fastdup - To analyze issues in the dataset.
MMEngine, MMDetection, groundingdino-py - To use the Grounding DINO and MMDetection model.
gdown - To download demo data hosted on Google Drive.

Run the following to install all the above packages.

pip install -Uq fastdup mmengine mmdet groundingdino-py gdown

Test the installation. If there's no error message, we are ready to go.

import fastdup
fastdup.__version__

'1.57'

🚧
CUDA Runtime
fastdup runs perfectly on CPUs, but larger models like Grounding DINO runs much slower on CPU compared to GPU.
This codes in this notebook can be run on CPU or GPU.
But, we highly recommend running in CUDA-enabled environment to reduce the run time. Running this notebook in Google Colab or Kaggle is a good start!

Download Dataset

Download the coco-minitrain dataset - A curated mini-training set consisting of 20% of COCO 2017 training dataset. The coco-minitrain consists of 25,000 images and annotations.

First, let's load the dataset from the coco-minitrain dataset.

gdown --fuzzy https://drive.google.com/file/d/1iSXVTlkV1_DhdYpVDqsjlT4NJFQ7OkyK/view
unzip -qq coco_minitrain_25k.zip

Zero-Shot Detection with Grounding DINO

Apart from zero-shot recognition models, fastdup also supports zero-shot detection models like Grounding DINO (and more to come).

Grounding DINO is a powerful open-set zero-shot detection model. It accepts image-text pairs as inputs and outputs a bounding box.

1. Inference on a bulk of images

In Part 1 of the enrichment notebook series, we utilized zero-shot image tagging models such as Recognize Anything Model and ran an inference over the images in our dataset.

We ended up with a DataFrame consisting of filename and ram_tags column as follows.

	filename	ram_tags
0	coco_minitrain_25k/images/val2017/000000382734.jpg	bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white
1	coco_minitrain_25k/images/val2017/000000508730.jpg	baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy
2	coco_minitrain_25k/images/val2017/000000202339.jpg	bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk
3	coco_minitrain_25k/images/val2017/000000460929.jpg	beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap
4	coco_minitrain_25k/images/val2017/000000181796.jpg	bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass

If you'd like to reproduce the above DataFrame, Part 1 notebook details the code you need to run.

We can now use the image tags from the above DataFrame in combination with Grounding DINO to further enrich the dataset with bounding boxes.

To run the enrichment on a DataFrame, use the fd.enrich method and specify model='grounding-dino'. By default fastdup loads the smaller variant (Swin-T) backbone for enrichment.

Also specify the DataFrame to run the enrichment on and the name of the column as the input to the Grounding DINO model. In this example, we take the text prompt from the ram_tags column which we have computed earlier.

fd = fastdup.create(input_dir='./coco_minitrain_25k')

df = fd.enrich(task='zero-shot-detection', 
               model='grounding-dino', 
               input_df=df, 
               input_col='ram_tags'
     )

📘
More on fd.enrich
Enriches an input DataFrame by applying a specified model to perform a specific task.
Currently supports the following parameters:
Parameter
Type
Description
Optional
Default
task
str
The task to be performed.
Supports
"zero-shot-classification"
,
"zero-shot-detection"
or
"zero-shot-segmentation"
as argument.
No
-
model
str
The model to be used.
Supports
"grounding-dino"
if
task=='zero-shot-detection'
.
No
-
input_df
DataFrame
The Pandas
DataFrame
containing the data to be enriched.
No
-
input_col
str
The name of the column in
input_df
to be used as input for the model.
No
-
num_rows
int
Number of rows from the top of
input_df
to be processed.
If not specified, all rows are processed.
Yes
None
device
str
The device used to run inference.
Supports
'cpu'
or
'cuda'
as argument.
Defaults to available devices if not specified.
Yes
None

Parameter	Type	Description	Optional	Default
`task`	str	The task to be performed. Supports `"zero-shot-classification"` , `"zero-shot-detection"` or `"zero-shot-segmentation"` as argument.	No	-
`model`	str	The model to be used. Supports `"grounding-dino"` if `task=='zero-shot-detection'` .	No	-
`input_df`	DataFrame	The Pandas `DataFrame` containing the data to be enriched.	No	-
`input_col`	str	The name of the column in `input_df` to be used as input for the model.	No	-
`num_rows`	int	Number of rows from the top of `input_df` to be processed. If not specified, all rows are processed.	Yes	`None`
`device`	str	The device used to run inference. Supports `'cpu'` or `'cuda'` as argument. Defaults to available devices if not specified.	Yes	`None`

Once, done you'll notice that 3 new columns are appended into the DataFrame namely - grounding_dino_bboxes, grounding_dino_scores, and grounding_dino_labels.

	filename	ram_tags	grounding_dino_bboxes	grounding_dino_scores	grounding_dino_labels
0	coco_minitrain_25k/images/val2017/000000382734.jpg	bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white	[(94.35, 479.79, 236.6, 589.37), (4.91, 3.74, 475.2, 637.34), (95.94, 514.92, 376.52, 638.46), (41.88, 37.47, 425.05, 637.11), (115.28, 602.27, 164.16, 635.21)]	[0.5791, 0.389, 0.4436, 0.3011, 0.36]	[bath, bathroom, floor, glass door, drain]
1	coco_minitrain_25k/images/val2017/000000508730.jpg	baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy	[(3.56, 2.74, 635.14, 475.59), (30.9, 104.9, 301.76, 476.29), (68.56, 105.02, 266.26, 267.82), (359.23, 116.82, 576.6, 475.89), (374.37, 116.78, 557.19, 254.07), (466.91, 0.72, 638.7, 117.03), (266.96, 433.88, 291.04, 476.78), (466.52, 349.26, 525.87, 405.73), (350.64, 272.64, 572.11, 476.47)]	[0.5868, 0.3726, 0.368, 0.3642, 0.3615, 0.3482, 0.3791, 0.3752, 0.3754]	[bathroom, toddler, hair, toddler, hair, bathroom accessory, hairbrush, diaper, chair]
2	coco_minitrain_25k/images/val2017/000000202339.jpg	bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk	[(73.28, 256.73, 135.63, 374.42), (103.57, 105.25, 267.71, 410.16), (98.31, 33.84, 271.81, 434.73), (203.72, 63.87, 463.35, 298.32), (147.5, 106.62, 163.5, 172.89), (164.11, 52.92, 272.88, 152.69), (0.49, 0.76, 82.85, 333.41), (1.95, 2.21, 477.75, 636.09), (1.75, 259.16, 478.42, 637.17), (398.16, 281.21, 479.01, 545.03), (147.03, 106.65, 163.66, 227.85)]	[0.6049, 0.4596, 0.4694, 0.4096, 0.3593, 0.4455, 0.4377, 0.3448, 0.3001, 0.3044, 0.4768]	[shopping bag, business suit, man, city bus, tie, sign, bus, bus station, walk, carry, tie]
3	coco_minitrain_25k/images/val2017/000000460929.jpg	beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap	[(288.12, 1.02, 423.48, 414.45), (178.93, 355.73, 327.0, 636.63), (214.55, 514.74, 280.53, 569.35), (234.04, 369.95, 286.26, 545.45), (1.41, 0.72, 478.31, 279.28), (5.18, 264.36, 476.5, 637.38), (170.21, 351.45, 356.89, 637.53), (18.26, 244.39, 415.3, 637.95), (211.98, 364.96, 287.85, 629.12), (295.22, 275.45, 399.62, 353.16), (1.46, 79.69, 477.83, 234.29)]	[0.5651, 0.3984, 0.3905, 0.3876, 0.3097, 0.3791, 0.3969, 0.3423, 0.4304, 0.3005, 0.3383]	[beer bottle, hot dog, mustard, tomato sauce, car, picnic table, hot dog, tinfoil, chili dog, condiment, car]
4	coco_minitrain_25k/images/val2017/000000181796.jpg	bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass	[(105.15, 0.56, 239.98, 190.18), (214.13, 60.52, 298.47, 154.0), (163.61, 136.45, 501.68, 358.81), (495.3, 47.58, 553.59, 98.79), (520.29, 27.48, 564.41, 58.55), (402.74, 177.07, 594.84, 222.6), (136.47, 45.8, 226.12, 98.31), (478.4, 31.41, 524.09, 67.48), (349.21, 27.77, 610.12, 119.72), (364.94, 264.18, 470.49, 335.62), (1.75, 1.75, 637.54, 358.17), (311.01, 48.55, 509.34, 102.32), (359.97, 245.94, 425.25, 299.72), (359.62, 0.35, 517.83, 28.28), (94.78, 165.47, 228.95, 209.81), (532.25, 0.4, 639.08, 80.23), (404.16, 156.78, 638.89, 344.61), (202.34, 144.76, 380.58, 285.75), (179.9, 138.23, 482.39, 358.21)]	[0.8581, 0.6903, 0.5242, 0.5105, 0.5285, 0.4578, 0.7963, 0.4057, 0.4656, 0.4421, 0.5056, 0.3742, 0.3959, 0.4158, 0.3858, 0.3345, 0.377, 0.3035, 0.3196]	[wine glass, cup, plate, cup, cup, fork, wine, cup, plate, vegetable, table, utensil, potato, platter, utensil, platter, silverware, meat, meal]

Now let's plot the results of the enrichment using the plot_annotationsfunction.

from fastdup.models_utils import plot_annotations

plot_annotations(df, 
                 image_col='filename',                # column specifying image filenames
                 tags_col='ram_tags',                 # column specifying image labels
                 bbox_col='grounding_dino_bboxes',    # column specifying bounding boxes
                 scores_col='grounding_dino_scores',  # column specifying label scores
                 labels_col='grounding_dino_labels',  # column specifying label text
                 num_rows=10                          # the number of rows in the dataframe to plot
)

Search for Specific Objects with Custom Text Prompt

Let's suppose you'd like to search for specific objects in your dataset, you can create a column in the DataFrame specifying the objects of interest and run the .enrich method.

Let's create a column in our DataFrame and name it custom_prompt.

df["custom_prompt"] = "face . eye . hair . "

Now let's run the enrichment with the custom prompt column.

df = fd.enrich(task='zero-shot-detection',  
               model='grounding-dino',  
               input_df=df,  
               input_col='custom_prompt'  
     )

df

	filename	ram_tags	grounding_dino_bboxes	grounding_dino_scores	grounding_dino_labels	custom_prompt
0	coco_minitrain_25k/images/val2017/000000382734.jpg	bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white	[]	[]	[]	face . eye . hair .
1	coco_minitrain_25k/images/val2017/000000508730.jpg	baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy	[(111.59, 183.91, 211.05, 300.08), (373.02, 117.49, 557.49, 255.62), (429.51, 205.88, 512.16, 275.5), (68.17, 105.79, 267.42, 265.81), (167.08, 234.47, 190.87, 247.09), (486.49, 222.14, 503.53, 232.73), (121.71, 238.6, 144.88, 249.47), (449.39, 223.3, 466.83, 232.75)]	[0.5336, 0.5664, 0.4492, 0.5795, 0.4003, 0.3643, 0.3969, 0.3435]	[face, hair, face, hair, eye, eye, eye, eye]	face . eye . hair .
2	coco_minitrain_25k/images/val2017/000000202339.jpg	bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk	[(135.15, 44.16, 172.74, 96.73), (133.59, 34.84, 179.46, 67.79), (153.42, 59.65, 163.44, 66.16)]	[0.504, 0.367, 0.3161]	[face, hair, eye]	face . eye . hair .
3	coco_minitrain_25k/images/val2017/000000460929.jpg	beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap	[]	[]	[]	face . eye . hair .
4	coco_minitrain_25k/images/val2017/000000181796.jpg	bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass	[]	[]	[]	face . eye . hair .

Not all images contain "face", "eye" and "hair", let's remove the columns with no detections and plot the column with detections.

df = df[df['grounding_dino_labels'].astype(bool)]

And plot the results.

plot_annotations(df, 
                 image_col='filename', 
                 tags_col='custom_prompt', 
                 bbox_col='grounding_dino_bboxes', 
                 scores_col='grounding_dino_scores', 
                 labels_col='grounding_dino_labels', 
                 num_rows=10
)

2. Inference on a single image

fastdup provides an easy way to load the Grounding DINO model and run an inference.

Let's suppose we have the following image and would like to run an inference with the Grounding DINO model.

from IPython.display import Image
Image("coco_minitrain_25k/images/val2017/000000449996.jpg")

You'll have to import the module and provide it with an image-text input pair.

from fastdup.models_grounding_dino import GroundingDINO

model = GroundingDINO()
results = model.run_inference(image_path="coco_minitrain_25k/images/val2017/000000449996.jpg",
                              text_prompt="air field . airliner . plane . airport . airport runway . airport terminal . jet . land . park . raceway . sky . tarmac . terminal",
                              box_threshold=0.3,
                              text_threshold=0.25
          )

📘
Note
Note: Text prompts must be separated with " . ".

By default, fastdup uses the smaller variant of Grounding DINO (Swin-T backbone).

The results variable contains a dict with labels, scores and bounding boxes.

{'labels': ['sky',
  'airport terminal',
  'plane',
  'airliner',
  'jet',
  'airport terminal',
  'jet',
  'tarmac'],
 'scores': [0.5281, 0.3444, 0.3824, 0.4883, 0.386, 0.3005, 0.3512, 0.3034],
 'boxes': [(1.47, 1.45, 638.46, 241.38),
  (329.36, 291.55, 468.11, 319.69),
  (142.03, 247.3, 261.97, 296.54),
  (443.6, 111.93, 495.47, 130.84),
  (113.85, 290.28, 246.56, 340.23),
  (518.36, 291.7, 638.48, 324.26),
  (391.59, 271.71, 465.11, 295.5),
  (2.34, 277.73, 637.63, 425.31)]}

Let's plot the image and results using the annotate_image convenience function.

from fastdup.models_utils import annotate_image

annotate_image("coco_minitrain_25k/images/val2017/000000449996.jpg", results)

You can optionally load another variant of Grounding DINO (Swin-B backbone) from the official Grounding DINO repo.

Download the weights and config into your local directory and pass them as arguments to the GroundingDINO contructor.

model = GroundingDINO(model_config="GroundingDINO_SwinB_cfg.py", 
                      model_weights="groundingdino_swinb_cogcoor.pth")

Convert Annotations to COCO Format

Once the enrichment is complete, you can also conveniently export the DataFrame into the COCO .json annotation format. For now, only the bounding boxes and labels are exported. Masks will be added in a future release.

from fastdup.models_utils import export_to_coco

export_to_coco(df, 
               bbox_col='grounding_dino_bboxes', 
               label_col='grounding_dino_labels', 
               json_filename='grounding_dino_annot_coco_format.json'
)

Wrap Up

In this tutorial, we showed how you can run zero-shot image detection models to enrich your dataset.

This notebook is Part 2 of the dataset enrichment notebook series where we utilize various zero-shot models to enrich datasets.

Part 1 - Dataset Enrichment with Zero-Shot Classification Models
Part 2 - Dataset Enrichment with Zero-Shot Detection Models
Part 3 - Dataset Enrichment with Zero-Shot Segmentation Models

👍
Next Up
Try out the Google Colab and Kaggle notebook to reproduce this example.
Also, check out Part 3 of the series where we explore how to generate bounding boxes from the tags using zero-shot detection models like Grounding DINO. See you there!

Questions about this tutorial? Reach out to us on our Slack channel!

VL Profiler - A faster and easier way to diagnose and visualize dataset issues

The team behind fastdup also recently launched VL Profiler, a no-code cloud-based platform that lets you leverage fastdup in the browser.

VL Profiler lets you find:

Duplicates/near-duplicates.
Outliers.
Mislabels.
Non-useful images.

Here's a highlight of the issues found in the RVL-CDIP test dataset on the VL Profiler.

👍
Free Usage
Use VL Profiler for free to analyze issues on your dataset with up to 1,000,000 images.
Get started for free.

Not convinced yet?

Interact with a collection of datasets like ImageNet-21K, COCO, and DeepFashion here.

No sign-ups needed.

GitHub • Join Slack Community • Discussion Forum Blog • Documentation • About Us LinkedIn • Twitter

Metadata Enrichment with Zero-Shot Detection Models

Purpose

Installation

CUDA Runtime

Download Dataset

Zero-Shot Detection with Grounding DINO

1. Inference on a bulk of images

More on `fd.enrich`

Search for Specific Objects with Custom Text Prompt

2. Inference on a single image

Note

Convert Annotations to COCO Format

Wrap Up

Next Up

VL Profiler - A faster and easier way to diagnose and visualize dataset issues

Free Usage

Purpose

Installation

CUDA Runtime

Download Dataset

Zero-Shot Detection with Grounding DINO

1. Inference on a bulk of images

More on fd.enrich

Search for Specific Objects with Custom Text Prompt

2. Inference on a single image

Note

Convert Annotations to COCO Format

Wrap Up

Next Up

VL Profiler - A faster and easier way to diagnose and visualize dataset issues

Free Usage

More on `fd.enrich`