solvePNP sensitive to input

Hi, I am having problem with solvePNP providing drastically different result given essentially same input. The function I am testing is as following

def pose_estimation(image_points, object_points, camera_matrix, dist_coeffs=np.zeros((4, 1))):
    print(image_points)
    print(object_points)
    print(camera_matrix)
    print(dist_coeffs)
    res = cv2.solvePnP(object_points, image_points, camera_matrix, dist_coeffs)

    image_points_ref = np.array([[413.00486363, 217.74341906],
       [413.15650142, 220.25746512],
       [407.77718261, 220.02347556],
       [407.6519472 , 220.75548504],
       [396.93019839, 220.29270531],
       [396.14034602, 219.52147414],
       [390.79016357, 219.29020324],
       [391.74411038, 220.06815504],
       [418.80503339, 216.85660952],
       [418.69262068, 216.17264099],
       [374.06361682, 214.25686018],
       [375.19012835, 214.98173447],
       [422.9617814 , 216.32101234],
       [371.79359441, 214.12534426]])
    
    assert np.linalg.norm(image_points-image_points_ref)<1e-7
    print(f"Diff between image_points {np.linalg.norm(image_points-image_points_ref)}")    
    assert type(image_points) == type(image_points_ref)   
    assert type(image_points.dtype) == type(image_points_ref.dtype)

    res_ref = cv2.solvePnP(object_points, image_points_ref, camera_matrix, dist_coeffs)
    print(res)
    print(res_ref)
    return res

The only difference between two calls to the solvePNP is I am using image_points_ref, which I copied after printing our image_points in the terminal, instead of image_points. However, the two calls to the solvePNP function is generating drastically different results. For the first one I am getting (which is close to ground truth) is

(True, array([[2.55860807],
       [2.46640239],
       [2.34886843]]), array([[ 225.93103637],
       [ -31.06022884],
       [3151.231523  ]]))

and for the second one

(True, array([[ 1.20061489],
       [ 1.27656775],
       [-1.21488959]]), array([[ 39.15125973],
       [ 21.73380486],
       [242.05403987]]))

I am wondering if it’s normal to get such a different result for small difference in the input? For reference, I am on Ubuntu 20.04, Python3.8, opencv-python 4.7.0.72 and 4.8.0.76. The number of image/object points is 14. The full output from this function is shown below.

[[413.00486363 217.74341906]
 [413.15650142 220.25746512]
 [407.77718261 220.02347556]
 [407.6519472  220.75548504]
 [396.93019839 220.29270531]
 [396.14034602 219.52147414]
 [390.79016357 219.29020324]
 [391.74411038 220.06815504]
 [418.80503339 216.85660952]
 [418.69262068 216.17264099]
 [374.06361682 214.25686018]
 [375.19012835 214.98173447]
 [422.9617814  216.32101234]
 [371.79359441 214.12534426]]
[[-1221.370483    16.052534     0.      ]
 [-1279.224854    16.947235     5.      ]
 [-1279.349731     8.911615     5.      ]
 [-1221.505737     8.033512     5.      ]
 [-1221.43811     -8.496282     5.      ]
 [-1279.302002    -8.493725     5.      ]
 [-1279.315796   -16.504263     5.      ]
 [-1221.462402   -16.498976     5.      ]
 [-1520.81        26.1257       5.      ]
 [-1559.122925    26.101082     5.      ]
 [-1559.157471   -30.753305     5.      ]
 [-1520.886353   -30.761044     5.      ]
 [-1561.039063    31.5222       5.      ]
 [-1561.039795   -33.577713     5.      ]]
[[1.25322156e+03 0.00000000e+00 3.20500000e+02]
 [0.00000000e+00 1.25322156e+03 2.40500000e+02]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00]]
[[0.]
 [0.]
 [0.]
 [0.]]
Diff between image_points 1.7516194186395812e-08
(True, array([[2.55860807],
       [2.46640239],
       [2.34886843]]), array([[ 225.93103637],
       [ -31.06022884],
       [3151.231523  ]]))
(True, array([[ 1.20061489],
       [ 1.27656775],
       [-1.21488959]]), array([[ 39.15125973],
       [ 21.73380486],
       [242.05403987]]))


I plotted your data (top: 3D data without Z, bottom: 2D data) I picked x/y axis scales that were close to equal, so the relative spacing/distances is approximately correct.

First thing I noticed is that the Z value is 5 for all of your points except the first point (which is 0) is this correct?

Second thing I noticed is that your camera matrix looks to be made up (not calibrated) based on the 320.5 / 240.5 values. (BTW, I think the correct values would be 319.5, 239.5, but I’m not certain.) Since you are (presumably) using an guesstimated camera matrix, I’m suspicious of your focal length as well. This might not be enough to wreck things entirely, but maybe? Also you have no lens distortion - again, you might be able to get away with this if your lens truly doesn’t have much distortion, but if it is a high distortion lens you’ll want to calibrate it.

I also annotated the images with two points A and B. Supposedly the A in the first image (3D) projects to the point labeled A in the second image, and same thing for B. This doesn’t make sense to me, and the overall structure of the two data sets don’t really look like they are related by a perspective projection.

I suspect your data is out of order, so there is no good mapping from the 3D points to the 2D points using any camera matrix. For example, I see a clustering of 4 point pairs in the 3D data - I would expect to see a similar clustering in the 2D data as well, but I don’t.

Assuming this is right (that your point correspondences aren’t actually correspondences), fix that and try again - I bet it will be a lot better.

There’s also a bigger lesson: when you get results that don’t make sense, add code that annotates your images in a way that you can visually verify what is going on. For example, using the input image you used as a starting point:
For each detected image point:

  1. Draw a circle on the image where you detected a feature. Draw it large enough so that it doesn’t obscure the feature, but not huge, maybe a 5-10 pixel radius?
  2. Project the corresponding 3D point using the recovered camera pose and camera intrinsics. Draw a circle (different color) on the same image using the projected location. Draw a line connecting the two points - this line maintains the relationship between the two points (one is the detected point, the second is the projection of the corresponding 3D point using the calibration results)
  3. Optionally keep track of the distance between the detected and projected image points so you can calculate a reprojection value. (I thought this was done for you automatically with the solvePNP call, but maybe not. This number is super helpful because you can quickly look at it and know if you have good results or batshit crazy results. (you have the latter type, it appears, which usually points to a usage or data problem.)

This step has been so important for me in tracking down data errors just like what you are seeing. It’s also very helpful in showing other types of error - for example, once you have fixed your data issue, you can get a visual idea of how much error you are introducing by not calibrating the distortion.

Good luck.

Oh, in case it isn’t clear, the reason you are getting such different results with input data that is only slightly different is because your data is so broken that there isn’t any good solution, only a bunch of terrible ones. It ends up choosing the best of the worst. Slight changes in your input data can move you far distances in the parameter space to get you a very slightly different version of the “best of the worst” solution. If that makes sense.

Hi Steve_in_Denver, thank you so much for your patient and insightful answer. The z value you pointed out is indeed wrong and it’s breaking the dataset. After fixing that I am able to get correct result. I should be more careful about this next time.