Can't parse tvecs from pose estimation using SolvePnP

Hello, I’m working on a project to determine the pose of a 3D array of markers, shown below.
Screen Shot 2022-09-29 at 12.47.17 AM

Camera matrix : [[627.44372826 0. 354.49962245] [ 0. 625.94738517 235.52154125] [ 0. 0. 1. ]]

Distortion Matrix : [[-0.00227971 0.0592749 0.00193853 -0.00431801 -0.05491578]]

Object points: np.array([(-25, -22, 0), (-25, 22, 0), (0, 0, -15), (25, -22, 0), (25, 22, 0) ], dtype=np.float32)

I’m able to successfully detect the image points using SimpleBlobDetector and pass those points into solvePnpRansac in the same order they are passed in the object points.

When I inspect the output translation vector however, the x, y, and z do not seem to be correct. Specifically, the Z-value is off by about 5x. edit - after fine-tuning my HSV mask the output tvec z-distance 0.5 the actual distance

I calibrated my camera using the chessboard scaled to the size of the chessboard squares (21.8mm) using:
objp = np.zeros((6*9, 3), np.float32) objp[:, :2] = np.mgrid[0:9, 0:6].T.reshape(-1, 2)*21.8.

Can someone give me an assessment of my approach and/or validate what I’m seeing? I’m just kind of using a ruler to estimate, but I’m not getting valid numbers.

Thank you.

this is an image of the marker set mask at roughly 300 mm from the center of the camera (measured with a ruler). I added labels to make sure the SimpleBlobDetector was correctly finding the markers.

The output tvecfrom SolvePnPRansac is: [[265.90126038] [ 84.38081316] [185.9831118 ]]

So based on this it’s looking like the actual Z-distance is about 1.5 greater than the output of SolvePnPRansac.

MRE please.

I’ll need either the image points, or an image and method to get image points from it.

then I need to know what your 25, 22, 15 values are. millimeters, checkerboard squares? you aren’t saying.

thanks for the link. Will comply.

#  order of points: top_left, bottom_left, middle, top_right, bottom_right in mm
img_points = np.array([
 [ 77.372375,  65.8471  ],
 [ 72.73076 , 227.59946 ],
 [151.1125  , 142.49625 ],
 [262.51657 ,  60.58985 ],
 [261.18253 , 230.10602 ]])
#  order of points: top_left, bottom_left, middle, top_right, bottom_right in mm
obj_points_in_millimeters = np.array([
 [-25., -22.,   0.],
 [-25.,  22.,   0.],
 [  0.,   0., -15.],
 [ 25., -22.,   0.],
 [ 25.,  22.,   0.]])
camera_matrix = np.array([
[627.44372826,   0.        , 354.49962245],
[  0.        , 625.94738517, 235.52154125],
[  0.        ,   0.        ,   1.        ]])

distortion_matrix = np.array([-0.00227971, 0.0592749, 0.00193853, -0.00431801, -0.05491578])

img of markers at 300 mm.
Screen Shot 2022-09-29 at 12.38.22 PM

I’m expecting the tvec z value (or the sqrt(x2 + y2 + z**2)) to be close to 300, but instead getting:

tvec = np.array([


  • this target is a rectangle
  • at 300 mm distance
  • 50 by 44 mm
  • appears roughly as 187 by 166 pixels
  • and your camera is supposed to have a focal length of ~626

let’s check that focal length.

50/300 * f = 187  =>  f = 1122
44/300 * f = 166  =>  f = 1132

and since we’re looking at the target slightly sideways, I’d take the height to be more reliable, so you’re closer to 1130 than 1120.

so, give or take some error, your focal length/calibration is junk. or something else doesn’t add up.

why is your camera picture 454 by 304? why was it earlier, annotated, 768 by 446?

how did you calibrate? I’m asking because what you said earlier isn’t telling me much, and showing nothing. a sample from the calibration pictures would be useful. you can add one per post, or use imgur or gdrive or something like that. sometimes, a single picture already reveals all the relevant issues. sometimes all need to be assessed.

for calibration I used about ~40-50 chessboard images and code below:

# termination criteria
criteria = (cv.TERM_CRITERIA_EPS + cv.TERM_CRITERIA_MAX_ITER, 30, 0.001)
# prepare object points, like (0,0,0), (1,0,0), (2,0,0) ....,(6,5,0)
objp = np.zeros((6*9, 3), np.float32)
#scale for size of chessboard, measured to be 21.8 mm
objp[:, :2] = np.mgrid[0:9, 0:6].T.reshape(-1, 2)*21.8
# Arrays to store object points and image points from all the images.
objpoints = [] # 3d point in real world space
imgpoints = [] # 2d points in image plane.


for fname in images:
    img = cv.imread(fname)
    gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
    # Find the chess board corners
    ret, corners = cv.findChessboardCorners(gray, (9, 6), None)

    # If found, add object points, image points (after refining them)
    if ret == True:
        corners2 = cv.cornerSubPix(gray, corners, (11, 11), (-1, -1), criteria)
        # Draw and display the corners
        cv.drawChessboardCorners(img, (9, 6), corners2, ret)
        cv.imshow('img', img)



ret, cmtx, dist, rvecs, tvecs = cv.calibrateCamera(objpoints, imgpoints, gray.shape[::-1], None, None)

why is your camera picture 454 by 304? why was it earlier, annotated, 768 by 446?
Sorry about that, I cropped my face out. I can resend with an image of true size w/out my mug in it

Here is a link to gdrive with a sample of the images i used for calibration: OpenCV - Google Drive

I’ll trust the calibration code. at a glance, it looks legit, and I don’t feel like running that. there’s no way to get that wrong and not have it blow up spectacularly, and that didn’t happen. by the way, for monocular intrinsic calibration, the object points of the checkerboard need no scaling.

I’m seeing a good variety of poses for the board… without oblique poses, the focal length may have a huge margin of error, but you’ve got those, so it’s not that type of issue. and the squares, in proportion to other identifiable objects, look close enough to 21.8 mm.

pictures are 720 by 480. guessing the field of view from the pictures, that would make me expect the 626 focal length to be true. 1130 would fit a 1280 by 720 camera better, or a camera with narrower FoV (zoomed in).

can I assume that you don’t zoom or scale at any point, neither optically nor digitally? did you take the calibration pictures in a different way from the pictures for pose estimation? webcams have “modes”. still mode and video mode may have differently sized crop regions of the sensor data.

the size of the (3D-printed?) target looks plausible too, in relation to your fingers. maybe someone stretched the ruler?

oh maybe it’s the way the photos were taken. I used mac’s built in photo booth for the still images and whatever opencv calls when I call cap = cv.VideoCapture(0).

Thanks for looking at this. I’ll keep trying to see if maybe the issue is the differences in modes for the cameras.

@crackwitz it was totally an issue with the calibration images. I redid the calibration images using an opencv script (don’t know why I didn’t do this in the first place) and it’s now dead-on accurate. Thank you so much.