Metadata Enrichment with Zero-Shot Detection Models

Enrich your visual data with zero-shot image detection models such as Grounding DINO and more to come.

Open in Colab Kaggle GitHub Stars

This notebook is Part 2 of the dataset enrichment notebook series where we utilize various zero-shot models to enrich datasets.

  • Part 1 - Dataset Enrichment with Zero-Shot Classification Models
  • Part 2 - Dataset Enrichment with Zero-Shot Detection Models
  • Part 3 - Dataset Enrichment with Zero-Shot Segmentation Models

If you haven't checked out Part 1, we highly encourage you to go through it first before proceeding with this notebook.

👍

Purpose

In this notebook, we show an end-to-end example of how you can enrich the metadata of your visual using open-source zero-shot models such Grounding DINO using the output we obtained from Part 1.

By the end of the notebook, you'll learn how to:

  • Install and load the Grounding DINO in fastdup.
  • Enrich your dataset using bounding boxes and labels generated by the Grounding DINO model.
  • Run inference using SAM on a single iamge.
  • Specify custom prompt to search for object of interest in your dataset.
  • Export the enriched dataset into COCO .json format.

Installation

First, let's install the necessary packages:

Run the following to install all the above packages.

pip install -Uq fastdup mmengine mmdet groundingdino-py gdown

Test the installation. If there's no error message, we are ready to go.

import fastdup
fastdup.__version__
'1.57'

🚧

CUDA Runtime

fastdup runs perfectly on CPUs, but larger models like Grounding DINO runs much slower on CPU compared to GPU.

This codes in this notebook can be run on CPU or GPU.

But, we highly recommend running in CUDA-enabled environment to reduce the run time. Running this notebook in Google Colab or Kaggle is a good start!

Download Dataset

Download the coco-minitrain dataset - A curated mini-training set consisting of 20% of COCO 2017 training dataset. The coco-minitrain consists of 25,000 images and annotations.

First, let's load the dataset from the coco-minitrain dataset.

gdown --fuzzy https://drive.google.com/file/d/1iSXVTlkV1_DhdYpVDqsjlT4NJFQ7OkyK/view
unzip -qq coco_minitrain_25k.zip

Zero-Shot Detection with Grounding DINO

Apart from zero-shot recognition models, fastdup also supports zero-shot detection models like Grounding DINO (and more to come).

Grounding DINO is a powerful open-set zero-shot detection model. It accepts image-text pairs as inputs and outputs a bounding box.

1. Inference on a bulk of images

In Part 1 of the enrichment notebook series, we utilized zero-shot image tagging models such as Recognize Anything Model and ran an inference over the images in our dataset.

We ended up with a DataFrame consisting of filename and ram_tags column as follows.

filename ram_tags
0 coco_minitrain_25k/images/val2017/000000382734.jpg bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white
1 coco_minitrain_25k/images/val2017/000000508730.jpg baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy
2 coco_minitrain_25k/images/val2017/000000202339.jpg bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk
3 coco_minitrain_25k/images/val2017/000000460929.jpg beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap
4 coco_minitrain_25k/images/val2017/000000181796.jpg bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass

If you'd like to reproduce the above DataFrame, Part 1 notebook details the code you need to run.

We can now use the image tags from the above DataFrame in combination with Grounding DINO to further enrich the dataset with bounding boxes.

To run the enrichment on a DataFrame, use the fd.enrich method and specify model='grounding-dino'. By default fastdup loads the smaller variant (Swin-T) backbone for enrichment.

Also specify the DataFrame to run the enrichment on and the name of the column as the input to the Grounding DINO model. In this example, we take the text prompt from the ram_tags column which we have computed earlier.

fd = fastdup.create(input_dir='./coco_minitrain_25k')

df = fd.enrich(task='zero-shot-detection', 
               model='grounding-dino', 
               input_df=df, 
               input_col='ram_tags'
     )

📘

More on fd.enrich

Enriches an input DataFrame by applying a specified model to perform a specific task.

Currently supports the following parameters:

ParameterTypeDescriptionOptionalDefault
taskstrThe task to be performed.

Supports "zero-shot-classification", "zero-shot-detection" or "zero-shot-segmentation" as argument.
No-
modelstrThe model to be used.

Supports "grounding-dino" if task=='zero-shot-detection'.
No-
input_dfDataFrameThe Pandas DataFrame containing the data to be enriched.No-
input_colstrThe name of the column in input_df to be used as input for the model.No-
num_rowsintNumber of rows from the top of input_df to be processed.

If not specified, all rows are processed.
YesNone
devicestrThe device used to run inference.

Supports 'cpu' or 'cuda'as argument.

Defaults to available devices if not specified.
YesNone

Once, done you'll notice that 3 new columns are appended into the DataFrame namely - grounding_dino_bboxes, grounding_dino_scores, and grounding_dino_labels.

filename ram_tags grounding_dino_bboxes grounding_dino_scores grounding_dino_labels
0 coco_minitrain_25k/images/val2017/000000382734.jpg bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white [(94.35, 479.79, 236.6, 589.37), (4.91, 3.74, 475.2, 637.34), (95.94, 514.92, 376.52, 638.46), (41.88, 37.47, 425.05, 637.11), (115.28, 602.27, 164.16, 635.21)] [0.5791, 0.389, 0.4436, 0.3011, 0.36] [bath, bathroom, floor, glass door, drain]
1 coco_minitrain_25k/images/val2017/000000508730.jpg baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy [(3.56, 2.74, 635.14, 475.59), (30.9, 104.9, 301.76, 476.29), (68.56, 105.02, 266.26, 267.82), (359.23, 116.82, 576.6, 475.89), (374.37, 116.78, 557.19, 254.07), (466.91, 0.72, 638.7, 117.03), (266.96, 433.88, 291.04, 476.78), (466.52, 349.26, 525.87, 405.73), (350.64, 272.64, 572.11, 476.47)] [0.5868, 0.3726, 0.368, 0.3642, 0.3615, 0.3482, 0.3791, 0.3752, 0.3754] [bathroom, toddler, hair, toddler, hair, bathroom accessory, hairbrush, diaper, chair]
2 coco_minitrain_25k/images/val2017/000000202339.jpg bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk [(73.28, 256.73, 135.63, 374.42), (103.57, 105.25, 267.71, 410.16), (98.31, 33.84, 271.81, 434.73), (203.72, 63.87, 463.35, 298.32), (147.5, 106.62, 163.5, 172.89), (164.11, 52.92, 272.88, 152.69), (0.49, 0.76, 82.85, 333.41), (1.95, 2.21, 477.75, 636.09), (1.75, 259.16, 478.42, 637.17), (398.16, 281.21, 479.01, 545.03), (147.03, 106.65, 163.66, 227.85)] [0.6049, 0.4596, 0.4694, 0.4096, 0.3593, 0.4455, 0.4377, 0.3448, 0.3001, 0.3044, 0.4768] [shopping bag, business suit, man, city bus, tie, sign, bus, bus station, walk, carry, tie]
3 coco_minitrain_25k/images/val2017/000000460929.jpg beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap [(288.12, 1.02, 423.48, 414.45), (178.93, 355.73, 327.0, 636.63), (214.55, 514.74, 280.53, 569.35), (234.04, 369.95, 286.26, 545.45), (1.41, 0.72, 478.31, 279.28), (5.18, 264.36, 476.5, 637.38), (170.21, 351.45, 356.89, 637.53), (18.26, 244.39, 415.3, 637.95), (211.98, 364.96, 287.85, 629.12), (295.22, 275.45, 399.62, 353.16), (1.46, 79.69, 477.83, 234.29)] [0.5651, 0.3984, 0.3905, 0.3876, 0.3097, 0.3791, 0.3969, 0.3423, 0.4304, 0.3005, 0.3383] [beer bottle, hot dog, mustard, tomato sauce, car, picnic table, hot dog, tinfoil, chili dog, condiment, car]
4 coco_minitrain_25k/images/val2017/000000181796.jpg bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass [(105.15, 0.56, 239.98, 190.18), (214.13, 60.52, 298.47, 154.0), (163.61, 136.45, 501.68, 358.81), (495.3, 47.58, 553.59, 98.79), (520.29, 27.48, 564.41, 58.55), (402.74, 177.07, 594.84, 222.6), (136.47, 45.8, 226.12, 98.31), (478.4, 31.41, 524.09, 67.48), (349.21, 27.77, 610.12, 119.72), (364.94, 264.18, 470.49, 335.62), (1.75, 1.75, 637.54, 358.17), (311.01, 48.55, 509.34, 102.32), (359.97, 245.94, 425.25, 299.72), (359.62, 0.35, 517.83, 28.28), (94.78, 165.47, 228.95, 209.81), (532.25, 0.4, 639.08, 80.23), (404.16, 156.78, 638.89, 344.61), (202.34, 144.76, 380.58, 285.75), (179.9, 138.23, 482.39, 358.21)] [0.8581, 0.6903, 0.5242, 0.5105, 0.5285, 0.4578, 0.7963, 0.4057, 0.4656, 0.4421, 0.5056, 0.3742, 0.3959, 0.4158, 0.3858, 0.3345, 0.377, 0.3035, 0.3196] [wine glass, cup, plate, cup, cup, fork, wine, cup, plate, vegetable, table, utensil, potato, platter, utensil, platter, silverware, meat, meal]

Now let's plot the results of the enrichment using the plot_annotationsfunction.

from fastdup.models_utils import plot_annotations

plot_annotations(df, 
                 image_col='filename',                # column specifying image filenames
                 tags_col='ram_tags',                 # column specifying image labels
                 bbox_col='grounding_dino_bboxes',    # column specifying bounding boxes
                 scores_col='grounding_dino_scores',  # column specifying label scores
                 labels_col='grounding_dino_labels',  # column specifying label text
                 num_rows=10                          # the number of rows in the dataframe to plot
)                         

Search for Specific Objects with Custom Text Prompt

Let's suppose you'd like to search for specific objects in your dataset, you can create a column in the DataFrame specifying the objects of interest and run the .enrich method.

Let's create a column in our DataFrame and name it custom_prompt.

df["custom_prompt"] = "face . eye . hair . "

Now let's run the enrichment with the custom prompt column.

df = fd.enrich(task='zero-shot-detection',  
               model='grounding-dino',  
               input_df=df,  
               input_col='custom_prompt'  
     )

df
filename ram_tags grounding_dino_bboxes grounding_dino_scores grounding_dino_labels custom_prompt
0 coco_minitrain_25k/images/val2017/000000382734.jpg bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white [] [] [] face . eye . hair .
1 coco_minitrain_25k/images/val2017/000000508730.jpg baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy [(111.59, 183.91, 211.05, 300.08), (373.02, 117.49, 557.49, 255.62), (429.51, 205.88, 512.16, 275.5), (68.17, 105.79, 267.42, 265.81), (167.08, 234.47, 190.87, 247.09), (486.49, 222.14, 503.53, 232.73), (121.71, 238.6, 144.88, 249.47), (449.39, 223.3, 466.83, 232.75)] [0.5336, 0.5664, 0.4492, 0.5795, 0.4003, 0.3643, 0.3969, 0.3435] [face, hair, face, hair, eye, eye, eye, eye] face . eye . hair .
2 coco_minitrain_25k/images/val2017/000000202339.jpg bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk [(135.15, 44.16, 172.74, 96.73), (133.59, 34.84, 179.46, 67.79), (153.42, 59.65, 163.44, 66.16)] [0.504, 0.367, 0.3161] [face, hair, eye] face . eye . hair .
3 coco_minitrain_25k/images/val2017/000000460929.jpg beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap [] [] [] face . eye . hair .
4 coco_minitrain_25k/images/val2017/000000181796.jpg bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass [] [] [] face . eye . hair .

Not all images contain "face", "eye" and "hair", let's remove the columns with no detections and plot the column with detections.

df = df[df['grounding_dino_labels'].astype(bool)]

And plot the results.

plot_annotations(df, 
                 image_col='filename', 
                 tags_col='custom_prompt', 
                 bbox_col='grounding_dino_bboxes', 
                 scores_col='grounding_dino_scores', 
                 labels_col='grounding_dino_labels', 
                 num_rows=10
)

2. Inference on a single image

fastdup provides an easy way to load the Grounding DINO model and run an inference.

Let's suppose we have the following image and would like to run an inference with the Grounding DINO model.

from IPython.display import Image
Image("coco_minitrain_25k/images/val2017/000000449996.jpg")

You'll have to import the module and provide it with an image-text input pair.

from fastdup.models_grounding_dino import GroundingDINO

model = GroundingDINO()
results = model.run_inference(image_path="coco_minitrain_25k/images/val2017/000000449996.jpg",
                              text_prompt="air field . airliner . plane . airport . airport runway . airport terminal . jet . land . park . raceway . sky . tarmac . terminal",
                              box_threshold=0.3,
                              text_threshold=0.25
          )

📘

Note

Note: Text prompts must be separated with " . ".

By default, fastdup uses the smaller variant of Grounding DINO (Swin-T backbone).

The results variable contains a dict with labels, scores and bounding boxes.

{'labels': ['sky',
  'airport terminal',
  'plane',
  'airliner',
  'jet',
  'airport terminal',
  'jet',
  'tarmac'],
 'scores': [0.5281, 0.3444, 0.3824, 0.4883, 0.386, 0.3005, 0.3512, 0.3034],
 'boxes': [(1.47, 1.45, 638.46, 241.38),
  (329.36, 291.55, 468.11, 319.69),
  (142.03, 247.3, 261.97, 296.54),
  (443.6, 111.93, 495.47, 130.84),
  (113.85, 290.28, 246.56, 340.23),
  (518.36, 291.7, 638.48, 324.26),
  (391.59, 271.71, 465.11, 295.5),
  (2.34, 277.73, 637.63, 425.31)]}

Let's plot the image and results using the annotate_image convenience function.

from fastdup.models_utils import annotate_image

annotate_image("coco_minitrain_25k/images/val2017/000000449996.jpg", results)

You can optionally load another variant of Grounding DINO (Swin-B backbone) from the official Grounding DINO repo.

Download the weights and config into your local directory and pass them as arguments to the GroundingDINO contructor.

model = GroundingDINO(model_config="GroundingDINO_SwinB_cfg.py", 
                      model_weights="groundingdino_swinb_cogcoor.pth")

Convert Annotations to COCO Format

Once the enrichment is complete, you can also conveniently export the DataFrame into the COCO .json annotation format. For now, only the bounding boxes and labels are exported. Masks will be added in a future release.

from fastdup.models_utils import export_to_coco

export_to_coco(df, 
               bbox_col='grounding_dino_bboxes', 
               label_col='grounding_dino_labels', 
               json_filename='grounding_dino_annot_coco_format.json'
)

Wrap Up

In this tutorial, we showed how you can run zero-shot image detection models to enrich your dataset.

This notebook is Part 2 of the dataset enrichment notebook series where we utilize various zero-shot models to enrich datasets.

  • Part 1 - Dataset Enrichment with Zero-Shot Classification Models
  • Part 2 - Dataset Enrichment with Zero-Shot Detection Models
  • Part 3 - Dataset Enrichment with Zero-Shot Segmentation Models

👍

Next Up

Try out the Google Colab and Kaggle notebook to reproduce this example.

Also, check out Part 3 of the series where we explore how to generate bounding boxes from the tags using zero-shot detection models like Grounding DINO. See you there!

Questions about this tutorial? Reach out to us on our Slack channel!

VL Profiler - A faster and easier way to diagnose and visualize dataset issues

The team behind fastdup also recently launched VL Profiler, a no-code cloud-based platform that lets you leverage fastdup in the browser.

VL Profiler lets you find:

  • Duplicates/near-duplicates.
  • Outliers.
  • Mislabels.
  • Non-useful images.

Here's a highlight of the issues found in the RVL-CDIP test dataset on the VL Profiler.

👍

Free Usage

Use VL Profiler for free to analyze issues on your dataset with up to 1,000,000 images.

Get started for free.

Not convinced yet?

Interact with a collection of datasets like ImageNet-21K, COCO, and DeepFashion here.

No sign-ups needed.