Cleaning Image Dataset

This tutorial shows how to clean an image collection or dataset from the issues found with fastdup.

By the end of the tutorial you'll learn how to:

  • Find various dataset issues with fastdup.
  • Collect problematic images for further action.

Setting Up

You can follow along this tutorial by running this notebook on Google Colab.

🚧

Google Colab free tier

Running this tutorial on Google Colab is possible but may take a while to complete due to the low computing resources provided in the free tier.

We recommend running this tutorial on your local machine, Google Colab Pro or equivalent.

If you're running this tutorial on your local machine, install fastdup with:

pip install fastdup

To verify the installation, run:

import fastdup
fastdup.__version__

This tutorial runs on version 0.906.

For a detailed list of installation options and supported platforms, see our installation guide.

Download Dataset

For the purpose of demonstration, we will be using the food-101 dataset which consists of 101 food classes with 1,000 images per class.

Download and extract the dataset by running:

!wget http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz
!tar xzf food-101.tar.gz

Once done, you should have a food-101/imagesfolder which contains the images.

📘

Why this dataset?

We use the food-101 dataset in this tutorial because of the general availability of the dataset to the public.

Bear in mind this is a highly curated and we may not find as many issues compared to a non-curated dataset.

Feel free to swap out this dataset for your own!

Run fastdup

With the folder set in place, let's run fastdup:

import fastdup  
fd = fastdup.create(work_dir="fastdup_food101_work_dir/",
                    input_dir="food-101/images/")
fd.run(ccthreshold=0.9) 

📘

Parameters

  • work_dir - Path to store the artifacts generated from the run.
  • input_dir - Path to the images.
  • ccthreshold - The cluster threshold parameter. Controls the minimal distance for clustering. Defaults to 0.96. Best value of ccthresholdvaries depending on use case and data.
    • A higher threshold clusters images that are highly similar resulting in lesser images in a cluster.
    • A lower threshold clusters less similar images together. Clusters have more diversity and a larger possible difference between images in the cluster.

👍

Reduce run time on free tier of Google Colab

If you're running this tutorial on the free tier of Google Colab, we recommend to run the analysis on a subset of the dataset instead of the entire dataset. This is done to reduce the waiting time for the run to complete.

You can specify the number of images to run on by specifying the num_images argument in fd.run. For example fd.run(num_images=40000)runs only on 40,000 images in the dataset.

Once the run completes you can get a summary of the run with:

fd.summary()

which outputs:

########################################################################################

Dataset Analysis Summary: 

    Dataset contains 40000 images
    Valid images are 100.00% (40,000) of the data, invalid are 0.00% (0) of the data
    Similarity:  1.26% (504) belong to 17 similarity clusters (components).
    98.74% (39,496) images do not belong to any similarity cluster.
    Largest cluster has 30 (0.07%) images.
    For a detailed analysis, use `.connected_components()`
(similarity threshold used is 0.9, connected component threshold used is 0.9).

    Outliers: 6.02% (2,409) of images are possible outliers, and fall in the bottom 5.00% of similarity values.
    For a detailed list of outliers, use `.outliers()`.
['Dataset contains 40000 images',
 'Valid images are 100.00% (40,000) of the data, invalid are 0.00% (0) of the data',
 'Similarity:  1.26% (504) belong to 17 similarity clusters (components).',
 '98.74% (39,496) images do not belong to any similarity cluster.',
 'Largest cluster has 30 (0.07%) images.',
 'For a detailed analysis, use `.connected_components()`\n(similarity threshold used is 0.9, connected component threshold used is 0.9).\n',
 'Outliers: 6.02% (2,409) of images are possible outliers, and fall in the bottom 5.00% of similarity values.',
 'For a detailed list of outliers, use `.outliers()`.']

Broken Images

Similar to the previous tutorial, let's start with low hanging fruit of finding corrupted images:

fd.invalid_instances()

which outputs:

img_filename fastdup_id error_code is_valid

📘

No broken images!

The output shows no broken images. So we are good to go here.

However, if there are broken images present (like in the previous tutorial), you'd see something like the following:

img_filename fastdup_id error_code is_valid
0 Abyssinian_34.jpg 135 ERROR_ZERO_SIZE_FILE False
1 Egyptian_Mau_139.jpg 2240 ERROR_ZERO_SIZE_FILE False
5 Egyptian_Mau_191.jpg 2293 ERROR_ZERO_SIZE_FILE False

List of Broken Images

To get a list of broken images run:

broken_images = fd.invalid_instances()
list_of_broken_images = broken_images['img_filename'].to_list()
list_of_broken_images

Since we did not have any broke images the output of the above code is:

[]

If fastdup encounters broken images, the output of the above snippet would look something like:

['Abyssinian_34.jpg',
 'Egyptian_Mau_139.jpg',
 'Egyptian_Mau_145.jpg']

👍

Tips

You can store these output list somewhere to take further action on. You might want to move the files, delete it, or relabel them.

Duplicates

Let's visualize duplicate image pairs with:

fd.vis.duplicates_gallery(num_images=5)

which outputs:

👍

Tips

  • Setting num_images=5shows a gallery of 5 duplicate pairs. Change this value to view more/less.
  • Running fd.vis.duplicates_galleryalso saves the resulting duplicates.html file into fastdup_food101_work_dir/gallery/.
  • Distance of 1.0 indicates that the image is an exact copy.

Image Clusters

You can visualize image clusters with:

fd.vis.component_gallery(num_images=5)

which outputs:

List of Duplicates

Now let's single out all duplicates and near-duplicates by running using the connected components function:

connected_components_df , _ = fd.connected_components()
connected_components_df.head()

which outputs:

fastdup_id component_id sum count mean_distance min_distance max_distance img_filename error_code is_valid
0 0 0 0.0 0.0 0.0 0.0 0.0 apple_pie/1005649.jpg VALID True
1 1 1 0.0 0.0 0.0 0.0 0.0 apple_pie/1011328.jpg VALID True
2 2 2 0.0 0.0 0.0 0.0 0.0 apple_pie/101251.jpg VALID True
3 3 3 0.0 0.0 0.0 0.0 0.0 apple_pie/1014775.jpg VALID True
4 4 4 0.0 0.0 0.0 0.0 0.0 apple_pie/1026328.jpg VALID True

Let's now write a utility function to get the clusters:

# a function to group connected components
def get_clusters(df, sort_by='count', min_count=2, ascending=False):
    # columns to aggregate
    agg_dict = {'img_filename': list, 'mean_distance': max, 'count': len}

    if 'label' in df.columns:
        agg_dict['label'] = list
    
    # filter by count
    df = df[df['count'] >= min_count]
    
    # group and aggregate columns
    grouped_df = df.groupby('component_id').agg(agg_dict)
    
    # sort
    grouped_df = grouped_df.sort_values(by=[sort_by], ascending=ascending)
    return grouped_df

And run it:

clusters_df = get_clusters(connected_components_df)
clusters_df.head()
img_filename mean_distance count
component_id
23830 [clam_chowder/1072684.jpg, clam_chowder/1113834.jpg, clam_chowder/1322415.jpg, clam_chowder/1437241.jpg, clam_chowder/2113399.jpg, clam_chowder/2140703.jpg, clam_chowder/2248997.jpg, clam_chowder/2361787.jpg, clam_chowder/2398168.jpg, clam_chowder/2542800.jpg, clam_chowder/2685745.jpg, clam_chowder/2770581.jpg, clam_chowder/3914755.jpg, clam_chowder/546975.jpg, clam_chowder/75800.jpg, clam_chowder/854517.jpg] 0.9163 16
31637 [dumplings/1045500.jpg, dumplings/140004.jpg, dumplings/1630799.jpg, dumplings/1695231.jpg, dumplings/1848359.jpg, dumplings/1872410.jpg, dumplings/1918394.jpg, dumplings/2524385.jpg, dumplings/3683752.jpg, dumplings/3739057.jpg, dumplings/3781725.jpg, dumplings/468796.jpg] 0.9302 12
31767 [dumplings/1450685.jpg, dumplings/1564985.jpg, dumplings/2500721.jpg, dumplings/2600333.jpg, dumplings/2606645.jpg, dumplings/2675187.jpg, dumplings/3030550.jpg, dumplings/3242297.jpg, dumplings/3532122.jpg, dumplings/625116.jpg] 0.9127 10
31760 [dumplings/1433645.jpg, dumplings/1813271.jpg, dumplings/1881086.jpg, dumplings/1998135.jpg, dumplings/2229749.jpg, dumplings/2561548.jpg, dumplings/2750447.jpg, dumplings/3363745.jpg, dumplings/834049.jpg] 0.9119 9
31699 [dumplings/1228546.jpg, dumplings/1270308.jpg, dumplings/231028.jpg, dumplings/2373653.jpg, dumplings/2571523.jpg, dumplings/263589.jpg, dumplings/2909040.jpg, dumplings/2950605.jpg, dumplings/3191742.jpg] 0.9180 9

The above shows the component (clusters) with the highest duplicates/near-duplicates.

Now let's keep one image from each cluster and remove the rest:

# First sample from each cluster that is kept
cluster_images_to_keep = []
list_of_duplicates = []

for cluster_file_list in clusters_df.img_filename:
    # keep first file, discard rest
    keep = cluster_file_list[0]
    discard = cluster_file_list[1:]
    
    cluster_images_to_keep.append(keep)
    list_of_duplicates.extend(discard)

print(f"Found {len(set(list_of_duplicates))} highly similar images to discard")

outputs:

Found 610 highly similar images to discard

Inspecting list_of_duplicates:

list_of_duplicates

outputs:

['clam_chowder/1113834.jpg',
 'clam_chowder/1322415.jpg',
 'clam_chowder/1437241.jpg',
 'clam_chowder/2113399.jpg',
 'clam_chowder/2140703.jpg',
 'clam_chowder/2248997.jpg',
 'clam_chowder/2361787.jpg',
 'clam_chowder/2398168.jpg',
 'clam_chowder/2542800.jpg',
 'clam_chowder/2685745.jpg',
 'clam_chowder/2770581.jpg',
 'clam_chowder/3914755.jpg',
 'clam_chowder/546975.jpg',
 'clam_chowder/75800.jpg',
 'clam_chowder/854517.jpg',
 'dumplings/140004.jpg',
 'dumplings/1630799.jpg',
 'dumplings/1695231.jpg',
 'dumplings/1848359.jpg',
 'dumplings/1872410.jpg',
 'dumplings/1918394.jpg',
 'dumplings/2524385.jpg',
 'dumplings/3683752.jpg',
 'dumplings/3739057.jpg',
 'dumplings/3781725.jpg',
 'dumplings/468796.jpg',
 'dumplings/1564985.jpg',
 'dumplings/2500721.jpg',
 'dumplings/2600333.jpg',
 'dumplings/2606645.jpg',
 'dumplings/2675187.jpg',
 'dumplings/3030550.jpg',
 'dumplings/3242297.jpg',
 'dumplings/3532122.jpg',
 'dumplings/625116.jpg',
 'dumplings/1813271.jpg',
 'dumplings/1881086.jpg',
 'dumplings/1998135.jpg',
 'dumplings/2229749.jpg',
 'dumplings/2561548.jpg',
 'dumplings/2750447.jpg',
 'dumplings/3363745.jpg',
 'dumplings/834049.jpg',
 'dumplings/1270308.jpg',
 'dumplings/231028.jpg',
 'dumplings/2373653.jpg',
 'dumplings/2571523.jpg',
 'dumplings/263589.jpg',
 'dumplings/2909040.jpg',
 'dumplings/2950605.jpg',
 'dumplings/3191742.jpg',
 'dumplings/1276808.jpg',
 'dumplings/1308246.jpg',
 'dumplings/1598923.jpg',
 'dumplings/2546897.jpg',
 'dumplings/2630977.jpg',
 'dumplings/263764.jpg',
 'dumplings/3412861.jpg',
 'dumplings/646942.jpg',
 'edamame/1714523.jpg',
 'edamame/2204418.jpg',
 'edamame/2483789.jpg',
 'edamame/2670224.jpg',
 'edamame/3432193.jpg',
 'edamame/3666348.jpg',
 'edamame/3788141.jpg',
 'edamame/825581.jpg',
 'clam_chowder/1945594.jpg',
 'clam_chowder/2508514.jpg',
 'clam_chowder/2673628.jpg',
 'clam_chowder/2789238.jpg',
 'clam_chowder/3549975.jpg',
 'clam_chowder/758162.jpg',
 'clam_chowder/903815.jpg',
 'clam_chowder/2509774.jpg',
 'clam_chowder/2676197.jpg',
 'clam_chowder/2992252.jpg',
 'clam_chowder/3264840.jpg',
 'clam_chowder/3395059.jpg',
 'clam_chowder/906900.jpg',
 'clam_chowder/967946.jpg',
 'clam_chowder/1511884.jpg',
 'clam_chowder/2426821.jpg',
 'clam_chowder/2476027.jpg',
 'clam_chowder/2869301.jpg',
 'clam_chowder/2897057.jpg',
 'clam_chowder/3291238.jpg',
 'clam_chowder/3830343.jpg',
 'dumplings/1563646.jpg',
 'dumplings/1897260.jpg',
 'dumplings/2444294.jpg',
 'dumplings/2951551.jpg',
 'dumplings/3101737.jpg',
 'dumplings/310672.jpg',
 'dumplings/3279575.jpg',
 'creme_brulee/1207812.jpg',
 'creme_brulee/1742194.jpg',
 'creme_brulee/1816938.jpg',
 'creme_brulee/312639.jpg',
 'creme_brulee/480234.jpg',
 'creme_brulee/59534.jpg',
 'club_sandwich/1318118.jpg',
 'club_sandwich/1775789.jpg',
 'club_sandwich/1886101.jpg',
 'club_sandwich/2778614.jpg',
 'club_sandwich/3106065.jpg',
 'club_sandwich/588478.jpg',
 'edamame/1086703.jpg',
 'edamame/2040753.jpg',
 'edamame/2390868.jpg',
 'edamame/3325153.jpg',
 'edamame/3520889.jpg',
 'edamame/677508.jpg',
 'club_sandwich/1840706.jpg',
 'club_sandwich/2272423.jpg',
 'club_sandwich/3526250.jpg',
 'club_sandwich/3646665.jpg',
 'club_sandwich/3664710.jpg',
 'dumplings/2380724.jpg',
 'dumplings/2707946.jpg',
 'dumplings/587831.jpg',
 'dumplings/876327.jpg',
 'dumplings/937912.jpg',
 'dumplings/1545564.jpg',
 'dumplings/1848509.jpg',
 'dumplings/3359158.jpg',
 'dumplings/3619519.jpg',
 'dumplings/3686831.jpg',
 'beignets/2399174.jpg',
 'beignets/2683786.jpg',
 'beignets/3520470.jpg',
 'beignets/595743.jpg',
 'beignets/832877.jpg',
 'edamame/2778957.jpg',
 'edamame/3519994.jpg',
 'edamame/3546677.jpg',
 'edamame/579614.jpg',
 'edamame/846598.jpg',
 'clam_chowder/3137773.jpg',
 'clam_chowder/686716.jpg',
 'clam_chowder/762499.jpg',
 'clam_chowder/777422.jpg',
 'clam_chowder/804904.jpg',
 'clam_chowder/3073323.jpg',
 'clam_chowder/3142771.jpg',
 'clam_chowder/3228022.jpg',
 'clam_chowder/513498.jpg',
 'bruschetta/2018603.jpg',
 'bruschetta/2229245.jpg',
 'bruschetta/261311.jpg',
 'bruschetta/3743680.jpg',
 'breakfast_burrito/1058434.jpg',
 'caesar_salad/2599756.jpg',
 'caesar_salad/520391.jpg',
 'eggs_benedict/2066348.jpg',
 'clam_chowder/2894611.jpg',
 'clam_chowder/596255.jpg',
 'clam_chowder/907742.jpg',
 'clam_chowder/947484.jpg',
 'edamame/2588718.jpg',
 'edamame/2847124.jpg',
 'edamame/3558096.jpg',
 'edamame/601042.jpg',
 'bruschetta/243736.jpg',
 'bruschetta/3805917.jpg',
 'bruschetta/3836578.jpg',
 'bruschetta/3896592.jpg',
 'clam_chowder/2396225.jpg',
 'clam_chowder/2603953.jpg',
 'clam_chowder/3689947.jpg',
 'clam_chowder/655847.jpg',
 'club_sandwich/1413794.jpg',
 'club_sandwich/1811271.jpg',
 'club_sandwich/2163422.jpg',
 'club_sandwich/3543955.jpg',
 'dumplings/1370046.jpg',
 'dumplings/1510091.jpg',
 'dumplings/2531851.jpg',
 'edamame/3185310.jpg',
 'edamame/3569901.jpg',
 'edamame/965396.jpg',
 'edamame/2975349.jpg',
 'edamame/3152528.jpg',
 'edamame/3301986.jpg',
 'bibimbap/1615665.jpg',
 'bibimbap/2795629.jpg',
 'bibimbap/964368.jpg',
 'bibimbap/2519286.jpg',
 'bibimbap/2572183.jpg',
 'bibimbap/3003579.jpg',
 'caesar_salad/3673948.jpg',
 'caesar_salad/620905.jpg',
 'caesar_salad/709638.jpg',
 'dumplings/2975772.jpg',
 'dumplings/3888349.jpg',
 'dumplings/599168.jpg',
 'bibimbap/2534963.jpg',
 'bibimbap/3571528.jpg',
 'bibimbap/3627919.jpg',
 'edamame/3213278.jpg',
 'edamame/3634423.jpg',
 'edamame/3920329.jpg',
 'dumplings/322034.jpg',
 'dumplings/3428971.jpg',
 'dumplings/432.jpg',
 'chicken_quesadilla/3004094.jpg',
 'chicken_quesadilla/3779974.jpg',
 'dumplings/2736144.jpg',
 'dumplings/3430692.jpg',
 'caesar_salad/3402604.jpg',
 'caesar_salad/3703325.jpg',
 'clam_chowder/1942294.jpg',
 'clam_chowder/2027156.jpg',
 'breakfast_burrito/662423.jpg',
 'breakfast_burrito/662424.jpg',
 'churros/3303373.jpg',
 'churros/3303522.jpg',
 'chicken_curry/2701143.jpg',
 'chicken_curry/882723.jpg',
 'clam_chowder/1063260.jpg',
 'clam_chowder/3024138.jpg',
 'bibimbap/2041700.jpg',
 'bibimbap/2346855.jpg',
 'edamame/1622192.jpg',
 'edamame/561133.jpg',
 'beignets/1728932.jpg',
 'beignets/1751352.jpg',
 'dumplings/2770853.jpg',
 'dumplings/625233.jpg',
 'chocolate_cake/51717.jpg',
 'chocolate_cake/55122.jpg',
 'bruschetta/1890619.jpg',
 'bruschetta/3462434.jpg',
 'edamame/1620027.jpg',
 'edamame/2916151.jpg',
 'dumplings/521153.jpg',
 'dumplings/882708.jpg',
 'edamame/1144040.jpg',
 'edamame/1225330.jpg',
 'edamame/684483.jpg',
 'edamame/952423.jpg',
 'crab_cakes/2780621.jpg',
 'crab_cakes/2780623.jpg',
 'edamame/3253578.jpg',
 'edamame/3620419.jpg',
 'croque_madame/3163125.jpg',
 'croque_madame/3865436.jpg',
 'dumplings/2182931.jpg',
 'dumplings/3458910.jpg',
 'dumplings/1146384.jpg',
 'dumplings/2108794.jpg',
 'creme_brulee/2418653.jpg',
 'creme_brulee/3684311.jpg',
 'dumplings/3537145.jpg',
 'dumplings/808822.jpg',
 'dumplings/231024.jpg',
 'dumplings/35818.jpg',
 'croque_madame/3224280.jpg',
 'croque_madame/3288700.jpg',
 'croque_madame/2598646.jpg',
 'croque_madame/3036159.jpg',
 'falafel/3370784.jpg',
 'falafel/438562.jpg',
 'dumplings/1557735.jpg',
 'dumplings/3635848.jpg',
 'escargots/637187.jpg',
 'escargots/637188.jpg',
 'croque_madame/157692.jpg',
 'croque_madame/290729.jpg',
 'ceviche/2796501.jpg',
 'ceviche/895716.jpg',
 'donuts/1774835.jpg',
 'donuts/2563686.jpg',
 'edamame/3243030.jpg',
 'edamame/3313851.jpg',
 'chicken_quesadilla/535532.jpg',
 'chicken_quesadilla/535546.jpg',
 'eggs_benedict/1972975.jpg',
 'eggs_benedict/2528340.jpg',
 'dumplings/3424747.jpg',
 'dumplings/55070.jpg',
 'edamame/3028728.jpg',
 'edamame/3112981.jpg',
 'eggs_benedict/3225684.jpg',
 'eggs_benedict/535020.jpg',
 'beef_tartare/1361899.jpg',
 'beef_tartare/3437886.jpg',
 'clam_chowder/2862215.jpg',
 'clam_chowder/795839.jpg',
 'croque_madame/3776229.jpg',
 'croque_madame/3873257.jpg',
 'clam_chowder/2641960.jpg',
 'clam_chowder/3289212.jpg',
 'donuts/2117632.jpg',
 'deviled_eggs/3281495.jpg',
 'donuts/3089074.jpg',
 'deviled_eggs/3902179.jpg',
 'deviled_eggs/3058137.jpg',
 'deviled_eggs/584369.jpg',
 'donuts/3124075.jpg',
 'deviled_eggs/3246571.jpg',
 'donuts/1954438.jpg',
 'deviled_eggs/3806337.jpg',
 'deviled_eggs/2671994.jpg',
 'apple_pie/1469191.jpg',
 'deviled_eggs/3491525.jpg',
 'croque_madame/2555777.jpg',
 'croque_madame/1870619.jpg',
 'croque_madame/3322423.jpg',
 'croque_madame/2269229.jpg',
 'croque_madame/1306940.jpg',
 'croque_madame/1497073.jpg',
 'creme_brulee/3245776.jpg',
 'creme_brulee/3155386.jpg',
 'creme_brulee/3487185.jpg',
 'creme_brulee/3054304.jpg',
 'creme_brulee/332369.jpg',
 'creme_brulee/2610691.jpg',
 'creme_brulee/2680133.jpg',
 'creme_brulee/2262132.jpg',
 'creme_brulee/2602002.jpg',
 'creme_brulee/2085820.jpg',
 'creme_brulee/2376691.jpg',
 'creme_brulee/722718.jpg',
 'croque_madame/611043.jpg',
 'croque_madame/691718.jpg',
 'deviled_eggs/2178531.jpg',
 'croque_madame/2962203.jpg',
 'deviled_eggs/1923965.jpg',
 'deviled_eggs/1721209.jpg',
 'deviled_eggs/1619934.jpg',
 'deviled_eggs/1568041.jpg',
 'deviled_eggs/1527126.jpg',
 'deviled_eggs/1378330.jpg',
 'donuts/2512789.jpg',
 'deviled_eggs/3021655.jpg',
 'cup_cakes/556378.jpg',
 'cup_cakes/1493261.jpg',
 'cup_cakes/1082593.jpg',
 'croque_madame/914187.jpg',
 'croque_madame/878201.jpg',
 'croque_madame/580678.jpg',
 'croque_madame/880779.jpg',
 'croque_madame/392709.jpg',
 'croque_madame/3414159.jpg',
 'donuts/2499239.jpg',
 'dumplings/2084607.jpg',
 'donuts/861022.jpg',
 'edamame/3840513.jpg',
 'falafel/1206667.jpg',
 'escargots/563386.jpg',
 'escargots/3688869.jpg',
 'escargots/3468449.jpg',
 'escargots/2667969.jpg',
 'escargots/2646994.jpg',
 'escargots/3004581.jpg',
 'escargots/2211156.jpg',
 'escargots/1637284.jpg',
 'escargots/2491502.jpg',
 'eggs_benedict/901333.jpg',
 'eggs_benedict/3238266.jpg',
 'eggs_benedict/3574668.jpg',
 'eggs_benedict/721876.jpg',
 'edamame/979556.jpg',
 'edamame/587222.jpg',
 'edamame/453226.jpg',
 'falafel/295629.jpg',
 'falafel/2505830.jpg',
 'falafel/3086998.jpg',
 'foie_gras/2870358.jpg',
 'foie_gras/459507.jpg',
 'foie_gras/3382988.jpg',
 'foie_gras/3029045.jpg',
 'foie_gras/3105826.jpg',
 'foie_gras/2857159.jpg',
 'foie_gras/2291174.jpg',
 'foie_gras/1721540.jpg',
 'foie_gras/21278.jpg',
 'falafel/3464997.jpg',
 'foie_gras/1051567.jpg',
 'filet_mignon/734006.jpg',
 'filet_mignon/646511.jpg',
 'filet_mignon/1666949.jpg',
 'falafel/3882357.jpg',
 'falafel/3789344.jpg',
 'falafel/3001734.jpg',
 'edamame/3831507.jpg',
 'edamame/667469.jpg',
 'dumplings/2106100.jpg',
 'edamame/2900759.jpg',
 'edamame/1659005.jpg',
 'dumplings/955413.jpg',
 'dumplings/774604.jpg',
 'dumplings/6201.jpg',
 'dumplings/663266.jpg',
 'dumplings/2545565.jpg',
 'dumplings/633367.jpg',
 'dumplings/28220.jpg',
 'dumplings/856176.jpg',
 'dumplings/267852.jpg',
 'dumplings/2800182.jpg',
 'dumplings/3554779.jpg',
 'dumplings/180290.jpg',
 'dumplings/231026.jpg',
 'dumplings/3153246.jpg',
 'dumplings/2932420.jpg',
 'dumplings/2942258.jpg',
 'edamame/1346107.jpg',
 'edamame/488373.jpg',
 'edamame/2977649.jpg',
 'edamame/2473555.jpg',
 'edamame/2803276.jpg',
 'edamame/3041151.jpg',
 'edamame/2708664.jpg',
 'edamame/2230705.jpg',
 'edamame/2545734.jpg',
 'edamame/3119358.jpg',
 'edamame/336171.jpg',
 'edamame/2574083.jpg',
 'edamame/2558511.jpg',
 'edamame/804283.jpg',
 'edamame/1969958.jpg',
 'edamame/864875.jpg',
 'edamame/2499082.jpg',
 'edamame/2302171.jpg',
 'edamame/1821106.jpg',
 'edamame/2157980.jpg',
 'creme_brulee/1888025.jpg',
 'clam_chowder/390727.jpg',
 'crab_cakes/2194081.jpg',
 'bibimbap/1809239.jpg',
 'bibimbap/1792799.jpg',
 'bibimbap/628343.jpg',
 'bibimbap/892182.jpg',
 'beignets/727595.jpg',
 'beignets/3573964.jpg',
 'beignets/492391.jpg',
 'beignets/2706264.jpg',
 'donuts/708597.jpg',
 'beignets/518797.jpg',
 'beignets/2004832.jpg',
 'beignets/1997437.jpg',
 'beignets/2735628.jpg',
 'beignets/935415.jpg',
 'beignets/1428238.jpg',
 'beignets/3873758.jpg',
 'beet_salad/3268468.jpg',
 'beet_salad/2671983.jpg',
 'beet_salad/374126.jpg',
 'beet_salad/1855829.jpg',
 'bibimbap/2499871.jpg',
 'bibimbap/574280.jpg',
 'bruschetta/3838937.jpg',
 'bibimbap/913532.jpg',
 'bruschetta/619290.jpg',
 'bruschetta/3696492.jpg',
 'bruschetta/3711344.jpg',
 'bruschetta/2161394.jpg',
 'caprese_salad/2730842.jpg',
 'breakfast_burrito/931734.jpg',
 'breakfast_burrito/491065.jpg',
 'ceviche/1205283.jpg',
 'bruschetta/711623.jpg',
 'bread_pudding/502700.jpg',
 'bibimbap/3884378.jpg',
 'bibimbap/3611974.jpg',
 'bibimbap/890594.jpg',
 'bibimbap/3670923.jpg',
 'bibimbap/495544.jpg',
 'bibimbap/2988372.jpg',
 'bibimbap/3096950.jpg',
 'bibimbap/3837493.jpg',
 'bibimbap/2399561.jpg',
 'beet_salad/1404312.jpg',
 'beef_tartare/50036.jpg',
 'beef_tartare/3646367.jpg',
 'beef_tartare/97478.jpg',
 'baklava/3518558.jpg',
 'baklava/3158786.jpg',
 'baklava/2209150.jpg',
 'baklava/2015716.jpg',
 'baklava/2186251.jpg',
 'baklava/1458610.jpg',
 'baklava/1150170.jpg',
 'baby_back_ribs/620997.jpg',
 'baby_back_ribs/3265047.jpg',
 'baby_back_ribs/3620137.jpg',
 'baby_back_ribs/3142431.jpg',
 'baby_back_ribs/3125728.jpg',
 'filet_mignon/2427308.jpg',
 'baby_back_ribs/801284.jpg',
 'baby_back_ribs/2306066.jpg',
 'baby_back_ribs/2129884.jpg',
 'apple_pie/839845.jpg',
 'apple_pie/3670966.jpg',
 'apple_pie/3324492.jpg',
 'beef_carpaccio/885771.jpg',
 'beef_carpaccio/679379.jpg',
 'beef_carpaccio/2290534.jpg',
 'beet_salad/686615.jpg',
 'beef_tartare/3191961.jpg',
 'beef_tartare/3185389.jpg',
 'beef_tartare/2561385.jpg',
 'beef_tartare/2426755.jpg',
 'beef_tartare/2030974.jpg',
 'beef_tartare/1562966.jpg',
 'beef_tartare/3722200.jpg',
 'beef_tartare/2038606.jpg',
 'beef_carpaccio/721638.jpg',
 'beef_carpaccio/2434359.jpg',
 'beef_carpaccio/3323355.jpg',
 'beef_carpaccio/3252686.jpg',
 'beef_carpaccio/3289048.jpg',
 'beef_carpaccio/3394009.jpg',
 'beef_carpaccio/2035002.jpg',
 'beef_carpaccio/1926900.jpg',
 'beef_carpaccio/1801501.jpg',
 'beef_carpaccio/2907748.jpg',
 'bruschetta/3387732.jpg',
 'bruschetta/3790099.jpg',
 'crab_cakes/814716.jpg',
 'clam_chowder/2385341.jpg',
 'clam_chowder/2742139.jpg',
 'clam_chowder/2754706.jpg',
 'clam_chowder/2148133.jpg',
 'clam_chowder/3126055.jpg',
 'clam_chowder/3045182.jpg',
 'clam_chowder/3457812.jpg',
 'clam_chowder/780765.jpg',
 'clam_chowder/3588064.jpg',
 'clam_chowder/1783836.jpg',
 'deviled_eggs/1999024.jpg',
 'clam_chowder/2762472.jpg',
 'clam_chowder/2553830.jpg',
 'clam_chowder/1871262.jpg',
 'clam_chowder/3572725.jpg',
 'churros/644700.jpg',
 'churros/2617186.jpg',
 'churros/1683636.jpg',
 'chocolate_mousse/2048018.jpg',
 'chocolate_mousse/2673864.jpg',
 'clam_chowder/2503659.jpg',
 'clam_chowder/3031443.jpg',
 'caesar_salad/728727.jpg',
 'clam_chowder/924933.jpg',
 'crab_cakes/20845.jpg',
 'crab_cakes/1460400.jpg',
 'crab_cakes/2885408.jpg',
 'fish_and_chips/359280.jpg',
 'club_sandwich/3845195.jpg',
 'club_sandwich/3550782.jpg',
 'club_sandwich/1377451.jpg',
 'club_sandwich/2856571.jpg',
 'club_sandwich/1380208.jpg',
 'club_sandwich/736461.jpg',
 'clam_chowder/948137.jpg',
 'clam_chowder/9768.jpg',
 'clam_chowder/511201.jpg',
 'apple_pie/1487150.jpg',
 'clam_chowder/3322877.jpg',
 'clam_chowder/3508063.jpg',
 'clam_chowder/3307340.jpg',
 'clam_chowder/3115414.jpg',
 'clam_chowder/2961270.jpg',
 'chicken_wings/834669.jpg',
 'chicken_wings/811798.jpg',
 'chicken_wings/3108137.jpg',
 'chicken_quesadilla/2223295.jpg',
 'cannoli/3846450.jpg',
 'cannoli/2295498.jpg',
 'cannoli/1982944.jpg',
 'cannoli/1357678.jpg',
 'eggs_benedict/1010197.jpg',
 'caesar_salad/3791298.jpg',
 'caesar_salad/3627251.jpg',
 'caesar_salad/3325086.jpg',
 'caesar_salad/3381505.jpg',
 'eggs_benedict/2748311.jpg',
 'caesar_salad/2707518.jpg',
 'caesar_salad/2683955.jpg',
 'caesar_salad/2319739.jpg',
 'caesar_salad/3912473.jpg',
 'caesar_salad/3315261.jpg',
 'caesar_salad/3479395.jpg',
 'chicken_quesadilla/2025030.jpg',
 'caesar_salad/3637443.jpg',
 'caesar_salad/2874871.jpg',
 'caprese_salad/992553.jpg',
 'caprese_salad/1473449.jpg',
 'caprese_salad/1411082.jpg',
 'ceviche/2523261.jpg',
 'chicken_quesadilla/1590716.jpg',
 'chicken_curry/70091.jpg',
 'chicken_curry/3496679.jpg',
 'chicken_curry/2617143.jpg',
 'cheese_plate/618425.jpg',
 'cheese_plate/3545251.jpg',
 'cheese_plate/3026695.jpg',
 'eggs_benedict/158871.jpg',
 'ceviche/172529.jpg',
 'caprese_salad/3753434.jpg',
 'carrot_cake/527702.jpg',
 'carrot_cake/3768473.jpg',
 'carrot_cake/3512754.jpg',
 'carrot_cake/3374621.jpg',
 'carrot_cake/3889387.jpg',
 'caprese_salad/763201.jpg',
 'caprese_salad/3289013.jpg',
 'caprese_salad/87213.jpg',
 'foie_gras/79314.jpg']

Additional reading: https://visual-layer.readme.io/docs/v02xx-api#remove_duplicates

Outliers

Now, visualize the outliers with:

fd.vis.outliers_gallery(num_images=5)

which outputs:

👍

Tips

  • Lower Distance value indicates the image is different than others in the dataset. Hence, the lower the Distance value, the higher the chances of outliers.
  • Since we know that the label of the image is by the name of the parent folder, we can already spot a couple of outliers from this gallery. For example, take the last image in the gallery, the label is given as dumplings but the image does not contain any dumplings.

List of Outliers

Let's first get the outliers DataFrame:

outlier_df = fd.outliers()
outlier_df.head()

Which outputs:

index outlier nearest distance img_filename_outlier error_code_outlier is_valid_outlier img_filename_nearest error_code_nearest is_valid_nearest
0 3999 9797 27221 0.295020 breakfast_burrito/462294.jpg VALID True creme_brulee/1661605.jpg VALID True
1 3997 21410 37470 0.556575 chocolate_cake/2518457.jpg VALID True filet_mignon/2685908.jpg VALID True
2 3995 11063 16727 0.563040 caesar_salad/1303023.jpg VALID True cheesecake/358018.jpg VALID True
3 3994 21885 2669 0.564055 chocolate_cake/577717.jpg VALID True baklava/3363412.jpg VALID True
4 3992 32123 31207 0.578329 dumplings/1339572.jpg VALID True donuts/1750980.jpg VALID True

Let's treat all images with distance<0.68 as outliers:

list_of_outliers = outlier_df[outlier_df.distance < 0.68].img_filename_outlier.tolist()
list_of_outliers

Outputs:

['breakfast_burrito/462294.jpg',  
 'chocolate_cake/2518457.jpg',  
 'caesar_salad/1303023.jpg',  
 'chocolate_cake/577717.jpg',  
 'dumplings/1339572.jpg',  
 'bibimbap/2594394.jpg',  
 'ceviche/2363511.jpg',  
 'churros/2327883.jpg',  
 'chicken_wings/693809.jpg',  
 'foie_gras/3776193.jpg',  
 'chicken_curry/2523126.jpg',  
 'churros/1440917.jpg',  
 'creme_brulee/1661605.jpg',  
 'apple_pie/484038.jpg',  
 'foie_gras/33258.jpg',  
 'cheesecake/2160930.jpg',  
 'cheesecake/1955517.jpg',  
 'chicken_curry/789540.jpg',  
 'cup_cakes/451074.jpg',  
 'cup_cakes/1005580.jpg',  
 'bread_pudding/1375816.jpg',  
 'chocolate_mousse/2177988.jpg',  
 'bruschetta/1883187.jpg',  
 'chocolate_cake/3600589.jpg',  
 'apple_pie/236966.jpg',  
 'caprese_salad/2719211.jpg',  
 'bibimbap/3230839.jpg',  
 'apple_pie/2008772.jpg',  
 'edamame/2979095.jpg',  
 'fish_and_chips/1566646.jpg',  
 'cup_cakes/601989.jpg',  
 'filet_mignon/2685908.jpg',  
 'baklava/3236360.jpg',  
 'baby_back_ribs/1676135.jpg',  
 'cup_cakes/2590269.jpg',  
 'chocolate_cake/2814515.jpg',  
 'churros/1972000.jpg',  
 'clam_chowder/759125.jpg',  
 'falafel/2585154.jpg',  
 'cup_cakes/630654.jpg',  
 'baklava/1553505.jpg',  
 'chocolate_cake/1749296.jpg',  
 'beignets/3506219.jpg',  
 'cheesecake/811556.jpg',  
 'chocolate_cake/1646662.jpg',  
 'donuts/921183.jpg',  
 'donuts/3316195.jpg',  
 'foie_gras/235773.jpg',  
 'churros/2550886.jpg',  
 'filet_mignon/2685899.jpg',  
 'chocolate_cake/2479257.jpg',  
 'beet_salad/1456898.jpg',  
 'cheesecake/2465886.jpg',  
 'churros/1658982.jpg',  
 'creme_brulee/107007.jpg',  
 'churros/3690003.jpg',  
 'chocolate_cake/1244445.jpg',  
 'apple_pie/755031.jpg',  
 'deviled_eggs/2854885.jpg',  
 'cannoli/3300725.jpg',  
 'churros/3169818.jpg',  
 'donuts/794976.jpg',  
 'cannoli/1070382.jpg',  
 'beet_salad/1643533.jpg',  
 'chocolate_mousse/2048999.jpg',  
 'churros/2741606.jpg',  
 'beignets/726875.jpg',  
 'chocolate_mousse/2287892.jpg',  
 'filet_mignon/3030737.jpg',  
 'fish_and_chips/876010.jpg',  
 'churros/1944265.jpg',  
 'cheese_plate/3119696.jpg',  
 'donuts/456541.jpg',  
 'churros/962826.jpg',  
 'churros/679673.jpg',  
 'donuts/1452592.jpg',  
 'donuts/3347684.jpg',  
 'baklava/3278527.jpg',  
 'bread_pudding/2585974.jpg',  
 'beef_tartare/913291.jpg',  
 'creme_brulee/1138671.jpg',  
 'chocolate_mousse/3604313.jpg',  
 'chocolate_mousse/1320051.jpg',  
 'chocolate_cake/985141.jpg',  
 'chocolate_cake/51412.jpg',  
 'cheesecake/2617496.jpg',  
 'club_sandwich/1127992.jpg',  
 'escargots/3406878.jpg',  
 'carrot_cake/580925.jpg',  
 'chocolate_cake/2174801.jpg',  
 'chicken_curry/889805.jpg',  
 'chocolate_cake/2067510.jpg',  
 'creme_brulee/202057.jpg',  
 'caprese_salad/2298180.jpg',  
 'chocolate_mousse/2688431.jpg',  
 'chocolate_mousse/2616372.jpg',  
 'chocolate_cake/771009.jpg',  
 'churros/1995090.jpg',  
 'breakfast_burrito/1229548.jpg',  
 'donuts/1167771.jpg',  
 'baby_back_ribs/2083106.jpg',  
 'bibimbap/2011447.jpg',  
 'churros/1977745.jpg',  
 'churros/3447996.jpg',  
 'chocolate_cake/3169533.jpg',  
 'donuts/1828646.jpg',  
 'baklava/1452465.jpg',  
 'chocolate_cake/2280321.jpg',  
 'beignets/3568316.jpg',  
 'beef_tartare/1054197.jpg',  
 'cup_cakes/3691610.jpg']

Dark, Bright, Blurry Images

Similar to the previous tutorial, we can visualize the dark, bright, and blurry images with:

fd.vis.stats_gallery(metric='dark')
fd.vis.stats_gallery(metric='bright')
fd.vis.stats_gallery(metric='blur')

List of Dark Images

Get a DataFrame of dark images:

stats_df = fd.img_stats()

If an image has amean<13 then we conclude it's a dark image:

dark_images = stats_df[stats_df['mean'] < 13]  
dark_images

Outputs:

fastdup_id img_w img_h unique blur mean min max stdv file_size contrast img_filename error_code is_valid
3090 3090 512 306 0 535.7338 5.5205 0.0 255.0 15.3110 27433 1.0 beef_carpaccio/1259270.jpg VALID True
9797 9797 511 512 0 9.0875 1.8431 0.0 30.0 1.0524 8693 1.0 breakfast_burrito/462294.jpg VALID True

To get a list of the dark images:

list_of_dark_images = dark_images['img_filename'].to_list()
list_of_dark_images

Which outputs:

['beef_carpaccio/1259270.jpg', 'breakfast_burrito/462294.jpg']

List of Bright Images

Similar to the above, let's set that if mean>220.5 we will conclude it's a bright image:

bright_images = stats_df[stats_df['mean'] > 220.5]
bright_images.head()
fastdup_id img_w img_h unique blur mean min max stdv file_size contrast img_filename error_code is_valid
81 81 512 512 0 538.6821 225.8266 0.0 255.0 32.2799 32229 1.0 apple_pie/1289014.jpg VALID True
436 436 512 512 0 1245.6737 220.9703 0.0 255.0 40.3034 40344 1.0 apple_pie/2601590.jpg VALID True
589 589 512 512 0 1468.0642 227.5742 0.0 255.0 41.6247 50437 1.0 apple_pie/2997124.jpg VALID True
933 933 512 512 0 554.9135 232.6887 0.0 255.0 41.5226 41395 1.0 apple_pie/817552.jpg VALID True
1115 1115 512 512 0 1219.0579 230.7839 0.0 255.0 32.7307 52154 1.0 baby_back_ribs/1395570.jpg VALID True

Get a list of bright images:

list_of_bright_images = bright_images['img_filename'].to_list()
list_of_bright_images

Outputs:

['apple_pie/1289014.jpg',
 'apple_pie/2601590.jpg',
 'apple_pie/2997124.jpg',
 'apple_pie/817552.jpg',
 'baby_back_ribs/1395570.jpg',
 'baby_back_ribs/3841100.jpg',
 'baklava/1542333.jpg',
 'baklava/2229944.jpg',
 'baklava/2663954.jpg',
 'beef_carpaccio/1364391.jpg',
 'beef_carpaccio/1713850.jpg',
 'beef_carpaccio/1990775.jpg',
 'beef_carpaccio/3169022.jpg',
 'beef_carpaccio/872076.jpg',
 'beef_tartare/1282738.jpg',
 'beef_tartare/1720794.jpg',
 'beef_tartare/3603995.jpg',
 'beef_tartare/717367.jpg',
 'beignets/1688450.jpg',
 'beignets/3723694.jpg',
 'beignets/529117.jpg',
 'bread_pudding/1256062.jpg',
 'bread_pudding/3660360.jpg',
 'bread_pudding/3716756.jpg',
 'breakfast_burrito/2840993.jpg',
 'breakfast_burrito/3635548.jpg',
 'bruschetta/1346725.jpg',
 'bruschetta/2275519.jpg',
 'bruschetta/3269901.jpg',
 'bruschetta/770721.jpg',
 'caesar_salad/2039808.jpg',
 'caesar_salad/2761224.jpg',
 'cannoli/1237436.jpg',
 'cannoli/1793781.jpg',
 'cannoli/2799600.jpg',
 'cannoli/2821147.jpg',
 'cannoli/421018.jpg',
 'caprese_salad/2126956.jpg',
 'carrot_cake/1932607.jpg',
 'cheesecake/1325649.jpg',
 'cheesecake/2572821.jpg',
 'cheese_plate/1842697.jpg',
 'cheese_plate/3159443.jpg',
 'chicken_curry/2051444.jpg',
 'chicken_curry/2590404.jpg',
 'chicken_curry/3144187.jpg',
 'chicken_curry/3679727.jpg',
 'chicken_quesadilla/2786630.jpg',
 'chicken_quesadilla/3362240.jpg',
 'chicken_wings/2693829.jpg',
 'chocolate_cake/2584547.jpg',
 'churros/1572415.jpg',
 'churros/2151645.jpg',
 'churros/706007.jpg',
 'clam_chowder/1455612.jpg',
 'clam_chowder/172055.jpg',
 'clam_chowder/2009191.jpg',
 'clam_chowder/2054906.jpg',
 'clam_chowder/673650.jpg',
 'crab_cakes/445057.jpg',
 'crab_cakes/761280.jpg',
 'creme_brulee/1306834.jpg',
 'creme_brulee/1658062.jpg',
 'creme_brulee/2318273.jpg',
 'creme_brulee/2506003.jpg',
 'creme_brulee/3900789.jpg',
 'creme_brulee/392008.jpg',
 'creme_brulee/730057.jpg',
 'creme_brulee/849220.jpg',
 'croque_madame/3079934.jpg',
 'croque_madame/3484037.jpg',
 'deviled_eggs/1276764.jpg',
 'deviled_eggs/2218705.jpg',
 'deviled_eggs/50398.jpg',
 'donuts/2036733.jpg',
 'dumplings/1141514.jpg',
 'dumplings/1483996.jpg',
 'dumplings/2174768.jpg',
 'eggs_benedict/2492820.jpg',
 'falafel/2437617.jpg',
 'falafel/366728.jpg',
 'filet_mignon/103497.jpg',
 'filet_mignon/1841480.jpg',
 'foie_gras/1044237.jpg',
 'foie_gras/139942.jpg',
 'foie_gras/2721736.jpg',
 'foie_gras/302051.jpg',
 'foie_gras/3267247.jpg',
 'foie_gras/35694.jpg',
 'foie_gras/3917886.jpg',
 'foie_gras/583722.jpg',
 'foie_gras/71445.jpg',
 'foie_gras/71461.jpg']

List of Blurry Images

Similarly with blur images

blurry_images = stats_df[stats_df['blur'] < 50]
blurry_images.head()
fastdup_id img_w img_h unique blur mean min max stdv file_size contrast img_filename error_code is_valid
2123 2123 512 512 0 41.3781 116.2239 2.0 198.0 30.7362 25479 0.9800 baklava/1413667.jpg VALID True
2768 2768 384 512 0 45.5609 102.8172 40.0 226.0 37.5435 18740 0.6992 baklava/3681797.jpg VALID True
2829 2829 512 512 0 38.8840 214.0211 71.0 255.0 25.0954 23869 0.5644 baklava/3877397.jpg VALID True
2918 2918 512 512 0 47.2227 138.1602 0.0 255.0 24.5464 22279 1.0000 baklava/683225.jpg VALID True
6924 6924 384 512 0 45.9959 75.9253 0.0 176.0 23.9812 19018 1.0000 beignets/726875.jpg VALID True

Get list of blurry images

list_of_blurry_images = blurry_images['img_filename'].to_list()
list_of_blurry_images

Outputs:

['baklava/1413667.jpg',
 'baklava/3681797.jpg',
 'baklava/3877397.jpg',
 'baklava/683225.jpg',
 'beignets/726875.jpg',
 'bread_pudding/444890.jpg',
 'breakfast_burrito/462294.jpg',
 'carrot_cake/345630.jpg',
 'chocolate_mousse/1653769.jpg',
 'clam_chowder/1472641.jpg',
 'clam_chowder/2250407.jpg',
 'clam_chowder/908590.jpg',
 'dumplings/2174768.jpg']

Summary

In this tutorial we found 825 images with potential issues.

print(f"Broken: {len(list_of_broken_images)}")
print(f"Duplicates: {len(list_of_duplicates)}")
print(f"Outliers: {len(list_of_outliers)}")
print(f"Dark: {len(list_of_dark_images)}")
print(f"Bright: {len(list_of_bright_images)}")
print(f"Blurry: {len(list_of_blurry_images)}")

problem_images = list_of_duplicates + list_of_broken_images + list_of_outliers + list_of_dark_images + list_of_bright_images + list_of_blurry_images

print(f"Total unique images: {len(set(problem_images))}")

Outputs:

Broken: 0
Duplicates: 610
Outliers: 111
Dark: 2
Bright: 93
Blurry: 13
Total unique images: 825

In this tutorial, we've seen how to use fastdup to analyze an image dataset for potential problems such as broken image, duplicate, outliers and dark/bright/blurry image.

For each problem we got a list of file names for further action. Depending on use cases, you might choose to delete the image, relabel them or simply move the image elsewhere.

👍

TLDR

In this tutorial we learned how to:

  • Find various dataset issues with fastdup.
  • Collect problematic images for further action.