I have a large dataset of photos and I’m trying to find duplicates. The duplicates I’m dealing with have the exact same resolution (dimensions) and appear like exact copies to me - I can’t tell them apart. They are however saved with slightly different compression options (different JPEG quality or JPEG vs PNG etc).
I’ve used OpenCV’s PHash
which detects those really well. Unfortunately it also gives me a lot of false positives - images that are very similar yet different - for example two photos that were taken in a quick sequence so the person in the photo has a slightly different facial expression. I would like to avoid those visually different images being detected as duplicates.
So I thought I would use PHash
to quickly find all similar images, then use something more sensitive to tell the “real” duplicates apart. I only chose images that have the exact same resulting phash (Hamming distance = 0). Then tried to calculate how different the two photos are using 2 different approaches: root mean square error and OpenCV’s TM_SQDIFF_NORMED. Unfortunately both give me very unreliable, and often counter-intuitive results.
Here’s my reproduction of the issue using Lenna. I have 3 photos:
Photo A. Original PNG saved as JPEG:
Photo B. Same as A, but saved with maximum JPEG quality:
Photo C. Same as A, but badly edited to change the facial expression:
All 3 images have the same PHash. If I flick between A and B quickly in my photo viewer, I cannot tell the difference at all. In A and C however the difference is immediately noticeable. Unfortunately, both RMSE and Sqdiff give me the opposite result, by an order of magnitude:
A & B | A & C | |
---|---|---|
RMSE | 0.011931 | 0.003523 |
Sqdiff | 0.000465 | 0.000040 |
So both give me way smaller error when comparing 2 visually different images A & C. My guess is this is due to noise that JPEG compression adds in A vs B that is not perceived by human eye. I tried to visualise the difference by subtracting two images and I get the following results:
|A-B| looks completely black unless I really increase the brightness - only then I see a cloud of dots making up the noise. |A-C| however clearly shows the edited part (lips in the centre-bottom).
My question is - how do I properly quantify the difference that I see and tell these two cases apart reliably?
Here’s my code that I used to calculate RMSE and Sqdiff using EmguCV:
double GetSqdiffNormed(Mat img1, Mat img2)
{
using var imgRes = new Mat();
CvInvoke.MatchTemplate(img1, img2, imgRes, TemplateMatchingType.SqdiffNormed);
return (float)imgRes.GetData().GetValue(0, 0);
}
double GetRMSE(Mat img1, Mat img2)
{
img1.ConvertTo(img1, DepthType.Cv32F, 1.0 / 255);
img2.ConvertTo(img2, DepthType.Cv32F, 1.0 / 255);
using var imgRes = new Mat();
CvInvoke.Subtract(img1, img2, imgRes);
CvInvoke.Multiply(imgRes, imgRes, imgRes);
var sum = CvInvoke.Sum(imgRes).ToArray().Sum();
return Math.Sqrt(sum / (imgRes.Width * imgRes.Height * imgRes.NumberOfChannels));
}