I wrote a python script using opencv to compare 7k pictures amongst themselves and flag duplicates, but even on my PC:
6 core AMD Ryzen 5 3600X (12 logical ones),
Win10,
16GB RAM,
1TB SSD,
it takes too long. I added threading to make it go faster but even so it took like 10 hours to check 49 pictures against the 7k images. So now I am onto the next idea to speed it up by recompiling opencv to use my nvidia GTX 1660 Ti cause it has more cores. I am using this guide here but with more recent versions: CUDA Toolkit 11.8 and OpenCV 4.7.
I got to step 12.6, after using cmake to do the config and building the ALL_BUILD: 12.6. Right-click “ALL_BUILD ” and click build .
This step fails with the error DIFFERENT_SIZES_EXTRA: undeclared identifier.
maybe you dont even need gpu support for your task, but a less naive algorithm !
say, you have 40 new images, and want to find out, if those are already in your 7k database:
use phash, to generate 64bit (8 byte) signatures from your database images. (this will take a while, but you only need to do that once) put those into a dict with sig as key, filepath as value & serialize to disk.
for new images, get a phash signature again, compare that to those in the dict (a matter of microseconds), if it’s already there, discard, else save image to database folder & append sig/filepath to dict (and serialize dict again)
see, i’m regularily checking my webbrowser cache for images (fodder for machine learning), extracting like 1k images from there, and check those against a ~200k (and ever growing !) image database, works like a charm !