High reprojection error (~10 px) using solvepnp


I am currently using SolvePnP to retrieve the position of a camera drone. However the model fails to find a good solution, since I end up with a camera leading to high reprojection error (~10 pixels) :

The scene consists of a landscape with two roughly planar part and a ramp between the two :

Also when I tried SolvePnP Ransac it mostly eliminated most points from one of the planes, even if the final solution had a lower reprojection error (~4 px). But since it only consider planar points the found solution is not that good and is unable to capture the geometry of the scene.

Is there anything I can do to improve the found solution ?


  1. How did you calibrate the intrinsics, and what is the reprojection score from that process? Does the drone have a fixed lens, or does it have some sort of auto-focus (or even zoom) capability?

  2. How did you get the ground truth points, and how accurate are they?

  3. Was the drone in motion when it captured the image? Does the camera have a global shutter or rolling shutter?

For best results:

  1. Physically lock down the optics and then calibrate the intrinsics. Don’t proceed until you get good reprojection error from the intrinsics cal.
  2. Ensure you have accurate ground truth values. Consider restricting the points you use to ones with known good ground truth values and which can be found in the image reliably/repeatbly.
  3. Minimize any imaging effects (like rolling shutter / camera movement)

Post input images, images showing matchpoint detection etc.

Unfortunately I can’t post the matched image at the moment.

The intrinsics are obtained from the drone system. It can zoom (variable focal length over time) but I have this data too. I cannot calibrate the intrinsics btw (I only have access to old metadata). It is very zoomed in as the drone can be kilometers away from the target point, resulting in a small FOV (like ~2°).

The XY (absolute) location and pixel location are obtained by doing keypoint matching between the drone view and another view (accuracy of 2-3 pixels) which is already (perfectly) georeferenced, and the Z altitude is obtained from a terrain elevation model (since every point is located on ground). The elevation model itself can have a few meters of uncertainties.

Yes drone was moving and the camera was also moving wrt the drone itself (a few degrees of azimuth/elevation). I do not have info about shutter.

My question is rather, what is the accuracy I should expect wich such framework ? It seems that conditions are pretty bad to get a good accuracy.

I also have a situation where I have the ground truth camera position, but when I reproject them using the 3D points I collected it seems translated, so the issue probably comes from the 3D…

There are so many factors here that it’s impossible to give an estimate for the accuracy you should expect. For example, how was the intrinsic calibration done, and do you have any information on the quality / score during calibration. Do you know if the intrinsic calibration was done with your use case in mind, or maybe for a situation with more lax requirements? Based on the information you have provided, I think the results you are getting are pretty reasonable.

To start with, I’d think about what you know about the data, and try to figure out how the known errors should affect your reprojection estimates. You say 2-3 pixels between the drone view and a perfectly georeferenced view - I’m not sure I understand what you mean, but my instinct is that this will govern your best case reprojection error. Does it mean your best case is 2-3 pixels of reprojection error, or something else? I’m not sure - too many unknowns about the way this is measured, the nature of the “perfect” georeferenced other image, etc. Without knowing more, my gut says you’ll never do better than 2-3 pixels, and likely will do worse.

You also mentioned that the elevation data has a few meters of uncertainty. I would want to know how big a few meters is in pixels, at the distance and with the zoom level being used. You also mentioned that the distance from the drone to the target can be kilometers, and the FOV can be 2 deg. Using 2km distance and 2deg FOV, I get 35 meters of visible height. So a 1 meter error in elevation would result in a ~3% displacement in the image. For an image that is 1024 pixels tall, that’s about 30 pixels.

My instinct is that your intrinsics aren’t reliable, and frankly I wouldn’t trust any of the data without some proof / assurances.

A few suggestions:

Plot your data as error vectors (instead of two sets of points - that way it’s easier to see how the error changes across the image) and look for structure - maybe this will provide some clues on where the error is coming from. To me it looks like there might be some sort of scaling and/or rotation. I see some pretty big disparities in the lower left of the image, but much less disparity in other areas. I also see some areas where the apparent corresponding points (I’m looking at the “Predicted extrinsics” images, in the sparse area around 800,500) don’t seem to be consistent in the error they have. Some are fairly close, others are much further apart, and certain pairs of points seem to suggest a scale difference. I think plotting error vectors could really help here.

You said that RANSAC returned better results (about 4 pixels of error), but seemed to keep points from one plane and discard the others (that’s what I understood.) This seems like a big clue to me. When you project the 3D points to your image, do the planar points project well, and the others project with significant error? And if you use points from the other planar structure, does solvePnP give similarly good results for that plane (and again error for the other plane)? If so, that might mean something is wrong with your intrinsics.

I would suggest hand-selecting a subset of image points and corresponding world points from your data, being sure to include multiple points from each of the two planes, as well as other points from the other parts of the data. Focus on points that are clearly imaged and that you have high confidence of their location and corresponding world point. With this data you can run cv::calibrateCamera and should be able to get intrinsics from a single image (because the 3D data is not all coplanar, and I’m assuming the z distance of the points in the camera frame varies across your points). I’d try to get at least 30 points total, spread across as much of the image as you can. Project your 3D points based on the calibrated intrinsics and extrinsics returned by cv::calibrateCamera. Do the projected points match better than what you are currently getting? Compare the provided intrinsics with the ones you just calibrated. How different are they (focal length in particular, but also image center and distortion coefficients?

Maybe your problem is elsewhere, but my bet is that your intrinsics aren’t accurate enough for what you are trying to accomplish.

1 Like