Analyzing Image Classification Dataset

In this tutorial we use fastdup to analyze a labelled image classification dataset for potential issues.

By the end of this tutorial you'll learn how to:

  • Load and format annotations in fastdup.
  • Compare labels of similar images.
  • Visualize a subset of the dataset using its labels.

Setting Up

You can follow along this tutorial by running this notebook on Google Colab.

First, install fastdup with:

pip install fastdup

And verify the installation.

import fastdup
fastdup.__version__

This tutorial runs on version 0.906.

Download Dataset

We will be analyzing the Imagenette dataset - A 10-class ImageNet subset from fast.ai.

Imagenette consists of 10 classes from the original ImageNet dataset. It contains 13k images in total.

Download and extract dataset:

!wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz  
!tar -xzf imagenette2-160.tgz

Once done you should have a folder with the following structure.

./imagenette2-160/
├── noisy_imagenette.csv
├── train
│   ├── n01440764 
│   ├── n02102040 
│   ├── n02979186 
│   ├── n03000684 
│   ├── n03028079 
│   ├── n03394916 
│   ├── n03417042 
│   ├── n03425413 
│   ├── n03445777 
│   └── n03888257 
└── val
    ├── n01440764 
    ├── n02102040 
    ├── n02979186 
    ├── n03000684 
    ├── n03028079 
    ├── n03394916 
    ├── n03417042 
    ├── n03425413
    ├── n03445777 
    └── n03888257 

📘

Naming

  • noisy_imagenette.csv - A .csv files with labels.
  • train/ - Train images.
  • val/ - validation images.

Load & Format Annotations

We'll use pandas to load and format the annotations.

import pandas as pd

data_dir = 'imagenette2-160/'
csv_path = 'imagenette2-160/noisy_imagenette.csv'

As ImageNet uses codes for classes, we'll map them to their human readable values for ease of analysis (source):

label_map = {
    'n02979186': 'cassette_player', 
    'n03417042': 'garbage_truck', 
    'n01440764': 'tench', 
    'n02102040': 'English_springer', 
    'n03028079': 'church',
    'n03888257': 'parachute', 
    'n03394916': 'French_horn', 
    'n03000684': 'chain_saw', 
    'n03445777': 'golf_ball', 
    'n03425413': 'gas_pump'
}

Let's load the annotations.

df_annot = pd.read_csv(csv_path)
df_annot.head(3)
pathnoisy_labels_0noisy_labels_1noisy_labels_5noisy_labels_25noisy_labels_50is_valid
0train/n02979186/n02979186_9036.JPEGn02979186n02979186n02979186n02979186n02979186False
1train/n02979186/n02979186_11957.JPEGn02979186n02979186n02979186n02979186n03000684False
2train/n02979186/n02979186_9715.JPEGn02979186n02979186n02979186n03417042n03000684False

Transform the annotation into a format expected by fastdup.

# take relevant columns
df_annot = df_annot[['path', 'noisy_labels_0']]

# rename columns to fastdup's column names
df_annot = df_annot.rename({'noisy_labels_0': 'label', 'path': 'img_filename'}, axis='columns')

# create split column
df_annot['split'] = df_annot['img_filename'].apply(lambda x: x.split("/")[0])

# map label ids to regular labels
df_annot['label'] = df_annot['label'].map(label_map)

# show formated annotations
df_annot.head()
img_filenamelabelsplit
0train/n02979186/n02979186_9036.JPEGcassette_playertrain
1train/n02979186/n02979186_11957.JPEGcassette_playertrain
2train/n02979186/n02979186_9715.JPEGcassette_playertrain
3train/n02979186/n02979186_21736.JPEGcassette_playertrain
4train/n02979186/ILSVRC2012_val_00046953.JPEGcassette_playertrain

Run fastdup

With the annotations and folders in the right format, let's run fastdup and analyze the dataset.

import fastdup
work_dir = 'fastdup_imagenette'

fd = fastdup.create(work_dir=work_dir, input_dir=data_dir) 
fd.run(annotations=df_annot, ccthreshold=0.9, threshold=0.8)

📘

Info

  • ccthreshold - Threshold to use for the graph connected component. Default is 0.96.
  • threshold - Threshold to use for the graph generation. Default is 0.9.

Get a summary of the run.

fd.summary()
Dataset Analysis Summary: 

    Dataset contains 13394 images
    Valid images are 100.00% (13,394) of the data, invalid are 0.00% (0) of the data
    Similarity:  2.73% (366) belong to 20 similarity clusters (components).
    97.27% (13,028) images do not belong to any similarity cluster.
    Largest cluster has 40 (0.30%) images.
    For a detailed analysis, use `.connected_components()`
(similarity threshold used is 0.8, connected component threshold used is 0.9).

    Outliers: 6.20% (830) of images are possible outliers, and fall in the bottom 5.00% of similarity values.
    For a detailed list of outliers, use `.outliers()`.
['Dataset contains 13394 images',
 'Valid images are 100.00% (13,394) of the data, invalid are 0.00% (0) of the data',
 'Similarity:  2.73% (366) belong to 20 similarity clusters (components).',
 '97.27% (13,028) images do not belong to any similarity cluster.',
 'Largest cluster has 40 (0.30%) images.',
 'For a detailed analysis, use `.connected_components()`\n(similarity threshold used is 0.8, connected component threshold used is 0.9).\n',
 'Outliers: 6.20% (830) of images are possible outliers, and fall in the bottom 5.00% of similarity values.',
 'For a detailed list of outliers, use `.outliers()`.']

🚧

830 possible outliers

From the summary there are 830 possible outliers found. Let's inspect them further!

Outliers

Let's visualize the outliers in the dataset.

fd.vis.outliers_gallery()

📘

Label information

Note the label information in the outliers report.

👍

Anomalies?

Can you spot any anomalies from the report?

Hint - parachute and golf ball.

To get more details on the outliers, run:

fd.outliers().head()
indexoutliernearestdistanceimg_filename_outlierlabel_outliersplit_outliererror_code_outlieris_valid_outlierimg_filename_nearestlabel_nearestsplit_nearesterror_code_nearestis_valid_nearest
013381200917570.469904val/n03417042/n03417042_29412.JPEGgarbage_truckvalVALIDTruetrain/n02102040/n02102040_7256.JPEGEnglish_springertrainVALIDTrue
11336266497630.476124train/n02979186/n02979186_3967.JPEGcassette_playertrainVALIDTrueval/n01440764/n01440764_710.JPEGtenchvalVALIDTrue
21335272715710.476313train/n02979186/n02979186_5424.JPEGcassette_playertrainVALIDTruetrain/n02102040/n02102040_536.JPEGEnglish_springertrainVALIDTrue
313331217218170.479290val/n03417042/n03417042_91.JPEGgarbage_truckvalVALIDTruetrain/n02102040/n02102040_7868.JPEGEnglish_springertrainVALIDTrue
413321981100980.479516train/n02979186/n02979186_10387.JPEGcassette_playertrainVALIDTrueval/n02102040/n02102040_5272.JPEGEnglish_springervalVALIDTrue

Similarity Gallery

Other than the outliers, we can find possible mislabels by comparing a query image to other images in the dataset.

This can be done with:

fd.vis.similarity_gallery() 

To get a detailed information

fd.similarity().head(5)
fromtodistanceimg_filename_fromlabel_fromsplit_fromerror_code_fromis_valid_fromimg_filename_tolabel_tosplit_toerror_code_tois_valid_to
01152153900.968786val/n03394916/n03394916_30631.JPEGFrench_hornvalVALIDTruetrain/n03394916/n03394916_44127.JPEGFrench_horntrainVALIDTrue
15390115210.968786train/n03394916/n03394916_44127.JPEGFrench_horntrainVALIDTrueval/n03394916/n03394916_30631.JPEGFrench_hornvalVALIDTrue
21291477150.962458val/n03445777/n03445777_6882.JPEGgolf_ballvalVALIDTruetrain/n03445777/n03445777_13918.JPEGgolf_balltrainVALIDTrue
37715129140.962458train/n03445777/n03445777_13918.JPEGgolf_balltrainVALIDTrueval/n03445777/n03445777_6882.JPEGgolf_ballvalVALIDTrue
4111714040.953837train/n02102040/n02102040_1564.JPEGEnglish_springertrainVALIDTruetrain/n02102040/n02102040_3837.JPEGEnglish_springertrainVALIDTrue

Duplicates

Let's also check for duplicate pairs.

fd.vis.duplicates_gallery()

Image Clusters

fd.vis.component_gallery(num_images=5)

You can also visualize clusters with specific labels using the slice parameter. For example let's visualize clusters with the chain_saw label.

fd.vis.component_gallery(slice='chain_saw')

Summary

👍

We now added several important label-specific features to run on top of the raw dataset

Building over the image dataset analysis conducted without labels in the Cleaning and preparing a dataset tutorial, we've seen how to further slice the data and visualize specific classes of interest.


What’s Next

Next we will dive into a dataset labeled with bounding boxes, first analyzing the distribution and individual classes, and then finding outliers and possible mislabels, in preparation for training a model