Discussions
How to define the repeated images or how to select threshold?
Thank you for your project. I recently worked on a batch of data extracted from video captured by some fixed cameras(For object detection tasks, random date and random time sampling, with large intervals). Visualization reveals that this batch of data has a highly consistent background, with only a few different goals (target people, motor vehicles, non-motor vehicles). In fastdup, the result is usually above 0.9 points. The background will have an impact on the model, which may cause the model to learn some special background distribution information, and the generalization will be reduced due to the speculation (sorry I haven't had time to do the comparison experiment,and it may be tested within a week). Have you ever done a similar experiment? Perhaps the definition of duplicate images needs to be extended a bit for object detection tasks? I also used imagedups for a comparative test. (https://github.com/chinalu/imagedups use default parameter) .Of the 20,000 images, 515 were determined to be duplicated and cleared. The top 515 in fastdup scored 0.95 or more.How should I choose the threshold (in other experiments such as ocr/ license plate detection data, I found that 0.95 is not a good threshold)?
Looking forward to your reply, thank you.