Metadata Enrichment with Zero-Shot Detection Models
Enrich your visual data with zero-shot image detection models such as Grounding DINO and more to come.
This notebook is Part 2 of the dataset enrichment notebook series where we utilize various zero-shot models to enrich datasets.
- Part 1 - Dataset Enrichment with Zero-Shot Classification Models
- Part 2 - Dataset Enrichment with Zero-Shot Detection Models
- Part 3 - Dataset Enrichment with Zero-Shot Segmentation Models
If you haven't checked out Part 1, we highly encourage you to go through it first before proceeding with this notebook.
Purpose
In this notebook, we show an end-to-end example of how you can enrich the metadata of your visual using open-source zero-shot models such Grounding DINO using the output we obtained from Part 1.
By the end of the notebook, you'll learn how to:
- Install and load the Grounding DINO in fastdup.
- Enrich your dataset using bounding boxes and labels generated by the Grounding DINO model.
- Run inference using SAM on a single iamge.
- Specify custom prompt to search for object of interest in your dataset.
- Export the enriched dataset into COCO
.json
format.
Installation
First, let's install the necessary packages:
- fastdup - To analyze issues in the dataset.
- MMEngine, MMDetection, groundingdino-py - To use the Grounding DINO and MMDetection model.
- gdown - To download demo data hosted on Google Drive.
Run the following to install all the above packages.
pip install -Uq fastdup mmengine mmdet groundingdino-py gdown
Test the installation. If there's no error message, we are ready to go.
import fastdup
fastdup.__version__
'1.57'
CUDA Runtime
fastdup runs perfectly on CPUs, but larger models like Grounding DINO runs much slower on CPU compared to GPU.
This codes in this notebook can be run on CPU or GPU.
But, we highly recommend running in CUDA-enabled environment to reduce the run time. Running this notebook in Google Colab or Kaggle is a good start!
Download Dataset
Download the coco-minitrain dataset - A curated mini-training set consisting of 20% of COCO 2017 training dataset. The coco-minitrain
consists of 25,000 images and annotations.
First, let's load the dataset from the coco-minitrain
dataset.
gdown --fuzzy https://drive.google.com/file/d/1iSXVTlkV1_DhdYpVDqsjlT4NJFQ7OkyK/view
unzip -qq coco_minitrain_25k.zip
Zero-Shot Detection with Grounding DINO
Apart from zero-shot recognition models, fastdup also supports zero-shot detection models like Grounding DINO (and more to come).
Grounding DINO is a powerful open-set zero-shot detection model. It accepts image-text pairs as inputs and outputs a bounding box.
1. Inference on a bulk of images
In Part 1 of the enrichment notebook series, we utilized zero-shot image tagging models such as Recognize Anything Model and ran an inference over the images in our dataset.
We ended up with a DataFrame consisting of filename
and ram_tags
column as follows.
filename | ram_tags | |
---|---|---|
0 | coco_minitrain_25k/images/val2017/000000382734.jpg | bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white |
1 | coco_minitrain_25k/images/val2017/000000508730.jpg | baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy |
2 | coco_minitrain_25k/images/val2017/000000202339.jpg | bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk |
3 | coco_minitrain_25k/images/val2017/000000460929.jpg | beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap |
4 | coco_minitrain_25k/images/val2017/000000181796.jpg | bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass |
If you'd like to reproduce the above DataFrame, Part 1 notebook details the code you need to run.
We can now use the image tags from the above DataFrame in combination with Grounding DINO to further enrich the dataset with bounding boxes.
To run the enrichment on a DataFrame, use the fd.enrich
method and specify model='grounding-dino'
. By default fastdup loads the smaller variant (Swin-T) backbone for enrichment.
Also specify the DataFrame to run the enrichment on and the name of the column as the input to the Grounding DINO model. In this example, we take the text prompt from the ram_tags
column which we have computed earlier.
fd = fastdup.create(input_dir='./coco_minitrain_25k')
df = fd.enrich(task='zero-shot-detection',
model='grounding-dino',
input_df=df,
input_col='ram_tags'
)
More on
fd.enrich
Enriches an input
DataFrame
by applying a specified model to perform a specific task.Currently supports the following parameters:
Parameter Type Description Optional Default task
str The task to be performed.
Supports"zero-shot-classification"
,"zero-shot-detection"
or"zero-shot-segmentation"
as argument.No - model
str The model to be used.
Supports"grounding-dino"
iftask=='zero-shot-detection'
.No - input_df
DataFrame The Pandas DataFrame
containing the data to be enriched.No - input_col
str The name of the column in input_df
to be used as input for the model.No - num_rows
int Number of rows from the top of input_df
to be processed.
If not specified, all rows are processed.Yes None
device
str The device used to run inference.
Supports'cpu'
or'cuda'
as argument.
Defaults to available devices if not specified.Yes None
Once, done you'll notice that 3 new columns are appended into the DataFrame namely - grounding_dino_bboxes
, grounding_dino_scores
, and grounding_dino_labels
.
filename | ram_tags | grounding_dino_bboxes | grounding_dino_scores | grounding_dino_labels | |
---|---|---|---|---|---|
0 | coco_minitrain_25k/images/val2017/000000382734.jpg | bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white | [(94.35, 479.79, 236.6, 589.37), (4.91, 3.74, 475.2, 637.34), (95.94, 514.92, 376.52, 638.46), (41.88, 37.47, 425.05, 637.11), (115.28, 602.27, 164.16, 635.21)] | [0.5791, 0.389, 0.4436, 0.3011, 0.36] | [bath, bathroom, floor, glass door, drain] |
1 | coco_minitrain_25k/images/val2017/000000508730.jpg | baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy | [(3.56, 2.74, 635.14, 475.59), (30.9, 104.9, 301.76, 476.29), (68.56, 105.02, 266.26, 267.82), (359.23, 116.82, 576.6, 475.89), (374.37, 116.78, 557.19, 254.07), (466.91, 0.72, 638.7, 117.03), (266.96, 433.88, 291.04, 476.78), (466.52, 349.26, 525.87, 405.73), (350.64, 272.64, 572.11, 476.47)] | [0.5868, 0.3726, 0.368, 0.3642, 0.3615, 0.3482, 0.3791, 0.3752, 0.3754] | [bathroom, toddler, hair, toddler, hair, bathroom accessory, hairbrush, diaper, chair] |
2 | coco_minitrain_25k/images/val2017/000000202339.jpg | bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk | [(73.28, 256.73, 135.63, 374.42), (103.57, 105.25, 267.71, 410.16), (98.31, 33.84, 271.81, 434.73), (203.72, 63.87, 463.35, 298.32), (147.5, 106.62, 163.5, 172.89), (164.11, 52.92, 272.88, 152.69), (0.49, 0.76, 82.85, 333.41), (1.95, 2.21, 477.75, 636.09), (1.75, 259.16, 478.42, 637.17), (398.16, 281.21, 479.01, 545.03), (147.03, 106.65, 163.66, 227.85)] | [0.6049, 0.4596, 0.4694, 0.4096, 0.3593, 0.4455, 0.4377, 0.3448, 0.3001, 0.3044, 0.4768] | [shopping bag, business suit, man, city bus, tie, sign, bus, bus station, walk, carry, tie] |
3 | coco_minitrain_25k/images/val2017/000000460929.jpg | beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap | [(288.12, 1.02, 423.48, 414.45), (178.93, 355.73, 327.0, 636.63), (214.55, 514.74, 280.53, 569.35), (234.04, 369.95, 286.26, 545.45), (1.41, 0.72, 478.31, 279.28), (5.18, 264.36, 476.5, 637.38), (170.21, 351.45, 356.89, 637.53), (18.26, 244.39, 415.3, 637.95), (211.98, 364.96, 287.85, 629.12), (295.22, 275.45, 399.62, 353.16), (1.46, 79.69, 477.83, 234.29)] | [0.5651, 0.3984, 0.3905, 0.3876, 0.3097, 0.3791, 0.3969, 0.3423, 0.4304, 0.3005, 0.3383] | [beer bottle, hot dog, mustard, tomato sauce, car, picnic table, hot dog, tinfoil, chili dog, condiment, car] |
4 | coco_minitrain_25k/images/val2017/000000181796.jpg | bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass | [(105.15, 0.56, 239.98, 190.18), (214.13, 60.52, 298.47, 154.0), (163.61, 136.45, 501.68, 358.81), (495.3, 47.58, 553.59, 98.79), (520.29, 27.48, 564.41, 58.55), (402.74, 177.07, 594.84, 222.6), (136.47, 45.8, 226.12, 98.31), (478.4, 31.41, 524.09, 67.48), (349.21, 27.77, 610.12, 119.72), (364.94, 264.18, 470.49, 335.62), (1.75, 1.75, 637.54, 358.17), (311.01, 48.55, 509.34, 102.32), (359.97, 245.94, 425.25, 299.72), (359.62, 0.35, 517.83, 28.28), (94.78, 165.47, 228.95, 209.81), (532.25, 0.4, 639.08, 80.23), (404.16, 156.78, 638.89, 344.61), (202.34, 144.76, 380.58, 285.75), (179.9, 138.23, 482.39, 358.21)] | [0.8581, 0.6903, 0.5242, 0.5105, 0.5285, 0.4578, 0.7963, 0.4057, 0.4656, 0.4421, 0.5056, 0.3742, 0.3959, 0.4158, 0.3858, 0.3345, 0.377, 0.3035, 0.3196] | [wine glass, cup, plate, cup, cup, fork, wine, cup, plate, vegetable, table, utensil, potato, platter, utensil, platter, silverware, meat, meal] |
Now let's plot the results of the enrichment using the plot_annotations
function.
from fastdup.models_utils import plot_annotations
plot_annotations(df,
image_col='filename', # column specifying image filenames
tags_col='ram_tags', # column specifying image labels
bbox_col='grounding_dino_bboxes', # column specifying bounding boxes
scores_col='grounding_dino_scores', # column specifying label scores
labels_col='grounding_dino_labels', # column specifying label text
num_rows=10 # the number of rows in the dataframe to plot
)
Search for Specific Objects with Custom Text Prompt
Let's suppose you'd like to search for specific objects in your dataset, you can create a column in the DataFrame specifying the objects of interest and run the .enrich
method.
Let's create a column in our DataFrame and name it custom_prompt
.
df["custom_prompt"] = "face . eye . hair . "
Now let's run the enrichment with the custom prompt column.
df = fd.enrich(task='zero-shot-detection',
model='grounding-dino',
input_df=df,
input_col='custom_prompt'
)
df
filename | ram_tags | grounding_dino_bboxes | grounding_dino_scores | grounding_dino_labels | custom_prompt | |
---|---|---|---|---|---|---|
0 | coco_minitrain_25k/images/val2017/000000382734.jpg | bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white | [] | [] | [] | face . eye . hair . |
1 | coco_minitrain_25k/images/val2017/000000508730.jpg | baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy | [(111.59, 183.91, 211.05, 300.08), (373.02, 117.49, 557.49, 255.62), (429.51, 205.88, 512.16, 275.5), (68.17, 105.79, 267.42, 265.81), (167.08, 234.47, 190.87, 247.09), (486.49, 222.14, 503.53, 232.73), (121.71, 238.6, 144.88, 249.47), (449.39, 223.3, 466.83, 232.75)] | [0.5336, 0.5664, 0.4492, 0.5795, 0.4003, 0.3643, 0.3969, 0.3435] | [face, hair, face, hair, eye, eye, eye, eye] | face . eye . hair . |
2 | coco_minitrain_25k/images/val2017/000000202339.jpg | bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk | [(135.15, 44.16, 172.74, 96.73), (133.59, 34.84, 179.46, 67.79), (153.42, 59.65, 163.44, 66.16)] | [0.504, 0.367, 0.3161] | [face, hair, eye] | face . eye . hair . |
3 | coco_minitrain_25k/images/val2017/000000460929.jpg | beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap | [] | [] | [] | face . eye . hair . |
4 | coco_minitrain_25k/images/val2017/000000181796.jpg | bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass | [] | [] | [] | face . eye . hair . |
Not all images contain "face", "eye" and "hair", let's remove the columns with no detections and plot the column with detections.
df = df[df['grounding_dino_labels'].astype(bool)]
And plot the results.
plot_annotations(df,
image_col='filename',
tags_col='custom_prompt',
bbox_col='grounding_dino_bboxes',
scores_col='grounding_dino_scores',
labels_col='grounding_dino_labels',
num_rows=10
)
2. Inference on a single image
fastdup provides an easy way to load the Grounding DINO model and run an inference.
Let's suppose we have the following image and would like to run an inference with the Grounding DINO model.
from IPython.display import Image
Image("coco_minitrain_25k/images/val2017/000000449996.jpg")
You'll have to import the module and provide it with an image-text input pair.
from fastdup.models_grounding_dino import GroundingDINO
model = GroundingDINO()
results = model.run_inference(image_path="coco_minitrain_25k/images/val2017/000000449996.jpg",
text_prompt="air field . airliner . plane . airport . airport runway . airport terminal . jet . land . park . raceway . sky . tarmac . terminal",
box_threshold=0.3,
text_threshold=0.25
)
Note
Note: Text prompts must be separated with
" . "
.
By default, fastdup uses the smaller variant of Grounding DINO (Swin-T backbone).
The results
variable contains a dict
with labels, scores and bounding boxes.
{'labels': ['sky',
'airport terminal',
'plane',
'airliner',
'jet',
'airport terminal',
'jet',
'tarmac'],
'scores': [0.5281, 0.3444, 0.3824, 0.4883, 0.386, 0.3005, 0.3512, 0.3034],
'boxes': [(1.47, 1.45, 638.46, 241.38),
(329.36, 291.55, 468.11, 319.69),
(142.03, 247.3, 261.97, 296.54),
(443.6, 111.93, 495.47, 130.84),
(113.85, 290.28, 246.56, 340.23),
(518.36, 291.7, 638.48, 324.26),
(391.59, 271.71, 465.11, 295.5),
(2.34, 277.73, 637.63, 425.31)]}
Let's plot the image and results using the annotate_image
convenience function.
from fastdup.models_utils import annotate_image
annotate_image("coco_minitrain_25k/images/val2017/000000449996.jpg", results)
You can optionally load another variant of Grounding DINO (Swin-B backbone) from the official Grounding DINO repo.
Download the weights and config into your local directory and pass them as arguments to the GroundingDINO
contructor.
model = GroundingDINO(model_config="GroundingDINO_SwinB_cfg.py",
model_weights="groundingdino_swinb_cogcoor.pth")
Convert Annotations to COCO Format
Once the enrichment is complete, you can also conveniently export the DataFrame into the COCO .json annotation format. For now, only the bounding boxes and labels are exported. Masks will be added in a future release.
from fastdup.models_utils import export_to_coco
export_to_coco(df,
bbox_col='grounding_dino_bboxes',
label_col='grounding_dino_labels',
json_filename='grounding_dino_annot_coco_format.json'
)
Wrap Up
In this tutorial, we showed how you can run zero-shot image detection models to enrich your dataset.
This notebook is Part 2 of the dataset enrichment notebook series where we utilize various zero-shot models to enrich datasets.
- Part 1 - Dataset Enrichment with Zero-Shot Classification Models
- Part 2 - Dataset Enrichment with Zero-Shot Detection Models
- Part 3 - Dataset Enrichment with Zero-Shot Segmentation Models
Next Up
Try out the Google Colab and Kaggle notebook to reproduce this example.
Also, check out Part 3 of the series where we explore how to generate bounding boxes from the tags using zero-shot detection models like Grounding DINO. See you there!
Questions about this tutorial? Reach out to us on our Slack channel!
VL Profiler - A faster and easier way to diagnose and visualize dataset issues
The team behind fastdup also recently launched VL Profiler, a no-code cloud-based platform that lets you leverage fastdup in the browser.
VL Profiler lets you find:
- Duplicates/near-duplicates.
- Outliers.
- Mislabels.
- Non-useful images.
Here's a highlight of the issues found in the RVL-CDIP test dataset on the VL Profiler.
Free Usage
Use VL Profiler for free to analyze issues on your dataset with up to 1,000,000 images.
Not convinced yet?
Interact with a collection of datasets like ImageNet-21K, COCO, and DeepFashion here.
No sign-ups needed.
Updated about 1 year ago