There are so many factors here that it’s impossible to give an estimate for the accuracy you should expect. For example, how was the intrinsic calibration done, and do you have any information on the quality / score during calibration. Do you know if the intrinsic calibration was done with your use case in mind, or maybe for a situation with more lax requirements? Based on the information you have provided, I think the results you are getting are pretty reasonable.
To start with, I’d think about what you know about the data, and try to figure out how the known errors should affect your reprojection estimates. You say 2-3 pixels between the drone view and a perfectly georeferenced view - I’m not sure I understand what you mean, but my instinct is that this will govern your best case reprojection error. Does it mean your best case is 2-3 pixels of reprojection error, or something else? I’m not sure - too many unknowns about the way this is measured, the nature of the “perfect” georeferenced other image, etc. Without knowing more, my gut says you’ll never do better than 2-3 pixels, and likely will do worse.
You also mentioned that the elevation data has a few meters of uncertainty. I would want to know how big a few meters is in pixels, at the distance and with the zoom level being used. You also mentioned that the distance from the drone to the target can be kilometers, and the FOV can be 2 deg. Using 2km distance and 2deg FOV, I get 35 meters of visible height. So a 1 meter error in elevation would result in a ~3% displacement in the image. For an image that is 1024 pixels tall, that’s about 30 pixels.
My instinct is that your intrinsics aren’t reliable, and frankly I wouldn’t trust any of the data without some proof / assurances.
A few suggestions:
Plot your data as error vectors (instead of two sets of points - that way it’s easier to see how the error changes across the image) and look for structure - maybe this will provide some clues on where the error is coming from. To me it looks like there might be some sort of scaling and/or rotation. I see some pretty big disparities in the lower left of the image, but much less disparity in other areas. I also see some areas where the apparent corresponding points (I’m looking at the “Predicted extrinsics” images, in the sparse area around 800,500) don’t seem to be consistent in the error they have. Some are fairly close, others are much further apart, and certain pairs of points seem to suggest a scale difference. I think plotting error vectors could really help here.
You said that RANSAC returned better results (about 4 pixels of error), but seemed to keep points from one plane and discard the others (that’s what I understood.) This seems like a big clue to me. When you project the 3D points to your image, do the planar points project well, and the others project with significant error? And if you use points from the other planar structure, does solvePnP give similarly good results for that plane (and again error for the other plane)? If so, that might mean something is wrong with your intrinsics.
I would suggest hand-selecting a subset of image points and corresponding world points from your data, being sure to include multiple points from each of the two planes, as well as other points from the other parts of the data. Focus on points that are clearly imaged and that you have high confidence of their location and corresponding world point. With this data you can run cv::calibrateCamera and should be able to get intrinsics from a single image (because the 3D data is not all coplanar, and I’m assuming the z distance of the points in the camera frame varies across your points). I’d try to get at least 30 points total, spread across as much of the image as you can. Project your 3D points based on the calibrated intrinsics and extrinsics returned by cv::calibrateCamera. Do the projected points match better than what you are currently getting? Compare the provided intrinsics with the ones you just calibrated. How different are they (focal length in particular, but also image center and distortion coefficients?
Maybe your problem is elsewhere, but my bet is that your intrinsics aren’t accurate enough for what you are trying to accomplish.