Hugging Face Datasets
Analyze any computer vision datasets from Hugging Face Datasets. We will analyze an image classification dataset for duplicates/near-duplicates, outliers and potential mislabels.
Installation
First, let's install the necessary packages to run this tutorial.
fastdup
- Analyze issues in the dataset.datasets
- Pull datasets from Hugging Face Datasets.
pip install fastdup datasets
Now, test the installation. If there's no error message, we are ready to go.
import fastdup
fastdup.__version__
'1.41'
InfoThere are over 50,000 openly available datasets on Hugging Face Datasets. Some datasets are gated and can only be downloaded if you have a user credential.
Load Dataset
The Hugging Face datasets
package provides an easy interface to load any datasets from the Hugging Face platform. On top of the package, fastdup provides a wrapper class FastdupHFDataset
as a connector to ensure the datasets
package works seamlessly within fastdup.
The FastdupHFDataset
class works the same way as the load_dataset
method. You can import the wrapper class and specify the name of the Hugging Face Datasets repository as the first argument.
In this example, we load the Tiny ImageNet dataset which contains 100,000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images, and 50 test images.
In the following code, we load the train split of the Tiny ImageNet dataset.
from fastdup.datasets import FastdupHFDataset
dataset = FastdupHFDataset("zh-plus/tiny-imagenet")
TipOptional parameters for the
FastdupHFDataset
class:
split
- Which split to download. Default:'train'
.img_key
- The key value for the dataset column containing images. Default:'image'
.label_key
- The key value for the dataset column containing labels. Default:'label'
.cache_dir
- Where to cache the downloaded dataset. Default:'/root/.cache/huggingface/datasets/'
jpg_save_dir
- Which folder to store thejpg
images. Default:'jpg_images'
reconvert_jpg
- Flag to force reconversion of images from.parquet
to.jpg
. Default:False
See implementation for details.
Now, let's inspect the dataset
object.
dataset
Dataset({
features: ['image', 'label'],
num_rows: 100000
})
Get the first element of the dataset.
dataset[0]
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64>,
'label': 0}
Get the PIL
image of the first element.
dataset[0]['image']

Get the label of the first element.
dataset[0]['label']
0
InfoYou can also confirm the image and label of the first element by heading to the dataset page.
Run fastdup
Once loaded, we can now analyze the dataset in fastdup by passing in 2 properties of dataset
into fastdup:
dataset.img_dir
- Returns the folder directory where the jpg images are saved.dataset.annotations
- Returns aDataFrame
of image and class labels.
dataset.img_dir
/root/.cache/huggingface/datasets/tiny-imagenet/jpg_images
dataset.annotations
filename | label | |
---|---|---|
0 | /root/.cache/huggingface/datasets/tiny-imagenet/jpg_images/38/19443.jpg | 38 |
1 | /root/.cache/huggingface/datasets/tiny-imagenet/jpg_images/38/19127.jpg | 38 |
2 | /root/.cache/huggingface/datasets/tiny-imagenet/jpg_images/38/19199.jpg | 38 |
3 | /root/.cache/huggingface/datasets/tiny-imagenet/jpg_images/38/19271.jpg | 38 |
4 | /root/.cache/huggingface/datasets/tiny-imagenet/jpg_images/38/19213.jpg | 38 |
Let's run fastdup and pass indataset.img_dir
and dataset.annotations
as arguments.
fd = fastdup.create(input_dir=dataset.img_dir)
fd.run(annotations=dataset.annotations)
Warning: fastdup create() without work_dir argument, output is stored in a folder named work_dir in your current working path.
FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
ad88 88
d8" ,d 88
88 88 88
MM88MMM ,adPPYYba, ,adPPYba, MM88MMM ,adPPYb,88 88 88 8b,dPPYba,
88 "" `Y8 I8[ "" 88 a8" `Y88 88 88 88P' "8a
88 ,adPPPPP88 `"Y8ba, 88 8b 88 88 88 88 d8
88 88, ,88 aa ]8I 88, "8a, ,d88 "8a, ,a88 88b, ,a8"
88 `"8bbdP"Y8 `"YbbdP"' "Y888 `"8bbdP"Y8 `"YbbdP'Y8 88`YbbdP"'
88
88
2023-09-19 21:40:45 [INFO] Going to loop over dir /tmp/tmpjb5y9uvf.csv
2023-09-19 21:40:45 [INFO] Found total 100000 images to run on, 100000 train, 0 test, name list 100000, counter 100000
2023-09-19 21:44:25 [INFO] Found total 100000 images to run onmated: 0 Minutes
Finished histogram 38.760
Finished bucket sort 38.971
2023-09-19 21:44:45 [INFO] 20456) Finished write_index() NN model
2023-09-19 21:44:45 [INFO] Stored nn model index file work_dir/nnf.index
2023-09-19 21:44:56 [INFO] Total time took 251595 ms
2023-09-19 21:44:56 [INFO] Found a total of 40 fully identical images (d>0.990), which are 0.02 % of total graph edges
2023-09-19 21:44:56 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 % of total graph edges
2023-09-19 21:44:56 [INFO] Found a total of 11083 above threshold images (d>0.900), which are 5.54 % of total graph edges
2023-09-19 21:44:56 [INFO] Found a total of 10001 outlier images (d<0.050), which are 5.00 % of total graph edges
2023-09-19 21:44:56 [INFO] Min distance found 0.601 max distance 1.000
2023-09-19 21:44:56 [INFO] Running connected components for ccthreshold 0.960000
.0
########################################################################################
Dataset Analysis Summary:
Dataset contains 100000 images
Valid images are 100.00% (100,000) of the data, invalid are 0.00% (0) of the data
Similarity: 0.05% (46) belong to 1 similarity clusters (components).
99.95% (99,954) images do not belong to any similarity cluster.
Largest cluster has 4 (0.00%) images.
For a detailed analysis, use `.connected_components()`
(similarity threshold used is 0.9, connected component threshold used is 0.96).
Outliers: 6.34% (6,344) of images are possible outliers, and fall in the bottom 5.00% of similarity values.
For a detailed list of outliers, use `.outliers()`.
########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################
0
Once completed, we can visualize the issues in fastdup galleries.
Duplicates
Let's visualize the duplicates in a gallery.
fd.vis.duplicates_gallery()

To get a detailed DataFrame on the duplicates/near-duplicate found, use the similarity
method.
similarity_df = fd.similarity()
similarity_df.head()
from | to | distance | filename_from | label_from | split_from | index_x | error_code_from | is_valid_from | fd_index_from | filename_to | label_to | split_to | index_y | error_code_to | is_valid_to | fd_index_to | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 67443 | 26670 | 1.0 | images_dir/images/banana/89460.jpg | banana | train | 67443 | VALID | True | 67443 | images_dir/images/orange/99613.jpg | orange | train | 26670 | VALID | True | 26670 |
1 | 26670 | 67443 | 1.0 | images_dir/images/orange/99613.jpg | orange | train | 26670 | VALID | True | 26670 | images_dir/images/banana/89460.jpg | banana | train | 67443 | VALID | True | 67443 |
2 | 26553 | 27990 | 1.0 | images_dir/images/orange/99746.jpg | orange | train | 26553 | VALID | True | 26553 | images_dir/images/lemon/88551.jpg | lemon | train | 27990 | VALID | True | 27990 |
3 | 29903 | 12251 | 1.0 | images_dir/images/brain-coral/6866.jpg | brain-coral | train | 29903 | VALID | True | 29903 | images_dir/images/coral-reef/95198.jpg | coral-reef | train | 12251 | VALID | True | 12251 |
4 | 13861 | 81041 | 1.0 | images_dir/images/mashed-potato/87457.jpg | mashed-potato | train | 13861 | VALID | True | 13861 | images_dir/images/meat-loaf_meatloaf/90376.jpg | meat-loaf_meatloaf | train | 81041 | VALID | True | 81041 |
We can get the number of duplicates/near-duplicates by filtering them on the distance
score. A distance
of 1.0
is an exact copy, and vice versa.
near_duplicates = similarity_df[similarity_df["distance"] >= 0.99]
near_duplicates = near_duplicates[["distance","filename_from", "filename_to", "label_from", "label_to"]]
len(near_duplicates)
40
Outliers
fd.vis.outliers_gallery()

outliers_df = fd.outliers()
outliers_df.head()
outlier | nearest | distance | filename_outlier | label_outlier | split_outlier | index_x | error_code_outlier | is_valid_outlier | fd_index_outlier | filename_nearest | label_nearest | split_nearest | index_y | error_code_nearest | is_valid_nearest | fd_index_nearest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49827 | 53704 | 0.600712 | images_dir/images/slug/99254.jpg | slug | train | 49827 | VALID | True | 49827 | images_dir/images/grasshopper_hopper/17431.jpg | grasshopper_hopper | train | 53704 | VALID | True | 53704 |
1 | 43915 | 70359 | 0.620984 | images_dir/images/fountain/47232.jpg | fountain | train | 43915 | VALID | True | 43915 | images_dir/images/syringe/74220.jpg | syringe | train | 70359 | VALID | True | 70359 |
2 | 60394 | 70399 | 0.639867 | images_dir/images/jellyfish/6152.jpg | jellyfish | train | 60394 | VALID | True | 60394 | images_dir/images/syringe/74452.jpg | syringe | train | 70399 | VALID | True | 70399 |
3 | 41058 | 93188 | 0.641278 | images_dir/images/walking-stick_walkingstick_stick-insect/17626.jpg | walking-stick_walkingstick_stick-insect | train | 41058 | VALID | True | 41058 | images_dir/images/cockroach_roach/18144.jpg | cockroach_roach | train | 93188 | VALID | True | 93188 |
4 | 64552 | 64680 | 0.648504 | images_dir/images/goldfish_Carassius-auratus/497.jpg | goldfish_Carassius-auratus | train | 64552 | VALID | True | 64552 | images_dir/images/goldfish_Carassius-auratus/118.jpg | goldfish_Carassius-auratus | train | 64680 | VALID | True | 64680 |
Mislabels
fd.vis.similarity_gallery(slice='diff')

Wrap Up
That's it! We've just conveniently surfaced many issues with this dataset by running fastdup. By taking care of dataset quality issues, we hope this will help you train better models.
Questions about this tutorial? Reach out to us on our Slack channel!
VL Profiler - A faster and easier way to diagnose and visualize dataset issues
The team behind fastdup also recently launched VL Profiler, a no-code cloud-based platform that lets you leverage fastdup in the browser.
VL Profiler lets you find:
- Duplicates/near-duplicates.
- Outliers.
- Mislabels.
- Non-useful images.
Here's a highlight of the issues found in the RVL-CDIP test dataset on the VL Profiler.
Free UsageUse VL Profiler for free to analyze issues on your dataset with up to 1,000,000 images.
Not convinced yet?
Interact with a collection of datasets like ImageNet-21K, COCO, and DeepFashion here.
No sign-ups needed.

Updated 6 months ago