Do I have a misunderstanding of solvePnP?

I’ve been working on using solvePNP to get the location of my detected object all day and now into the night :slight_smile: I’m starting to feel like maybe I have a big misunderstanding. I have a picture I took with my calibrated camera of my object. And I have a tflite model that detects all of my 2D points. I plot those points on the image as little white circles. Then I have a set of 3D points with units in meters. I got these points from my accurate 3D CAD model of my object. Their 0,0 reference point is just the center of the CAD model not the place they would hypothetically be in the picture. Each of these points corresponds 1:1 to each other. I have a calibrated camera matrix with all values in pixels (I know the pixel size but don’t convert it to meters).

Now I run these three through solvePNP and I get an rvec and a tvec. If I use projectPoints and plot those with red circles then the red and white circles almost exactly match up over the top of my object in the picture.

I’ve spent most of my day trying to figure out how to convert these into position and rotation for my model in Unity but to no avail. Now I just tried to simplify things and I look at the Tvec, it’s z axis is always negative which would put it behind my camera and I do not understand why.

So I wondered:
Am I misunderstanding how to use solvePNP. I thought if I took a picture, and I know the 3D points of the model then solvePNP will tell me how to rotate and translate the object such that it would be in the right position in the picture. Do I have this wrong? Do my 3D object points actually have to be what the points would be in that picture? I thought the point of solvePNP was to help me figure that out but maybe I’m lost here.

Should I change the units on my camera matrix to be in meters?

This is my code:

camera_matrix = np.array(
    [[777.173004, 0, 638.1196159],
     [0, 778.566312, 336.5782565],
     [0, 0, 1]], dtype="double"

#dist_coeffs = np.array([-0.050800878,-0.107825331,0.000861304,-0.00044001])
dist_coeffs = np.array([])

(success, rotation_vector, translation_vector) = cv2.solvePnP(model_points, image_points, camera_matrix,
                                                              dist_coeffs, flags=cv2.cv2.SOLVEPNP_ITERATIVE)

#(success, rotation_vector, translation_vector, inliers) = cv2.solvePnPRansac(model_points, image_points, camera_matrix,
#                                                              dist_coeffs, flags=cv2.cv2.SOLVEPNP_ITERATIVE)
print("Rotation Vector:\n" + str(rotation_vector))

print("Translation Vector:\n" + str(translation_vector))

(repro_points, jacobian) = cv2.projectPoints(model_points, rotation_vector,
                                                 translation_vector, camera_matrix, dist_coeffs)

original_img = cv2.imread(r"{}".format("IMG_20210301_193529.jpg"))
#original_img = cv2.resize(original_img, (192 , 192))
for p in repro_points:, (int(p[0][0]), int(p[0][1])), 3, (255, 255, 255), -1)
    print(str(p[0][0]) + "-" + str(p[0][1]))

for p in image_points:, (int(p[0]), int(p[1])), 3, (0, 0, 255), -1), (100, 100), 100, (255, 255, 255), -1)
cv2.imshow("og", original_img)

and my Tvec comes out as:
Translation Vector:
[[ 0.0896325 ]


I’d like to see some data.

it would also be nice to have some approximate numbers of what you expect to come out of this. say an expected tvec and data that should have roughly produced that result.

Hi, I can say you understood solvePNP correctly. Your Tvec is odd, so I must agree with @crackwitz , more info is needed to pinpoint the error, if any. I’ll add it seems useful to know how many correct correspondences you are getting, how many 2d points are correctly associated with your 3D model.

That said, I’ll add one comment about metrics, meters are for the 3D space, hence for you model. Pixels are for the projective space, so for your projected points. Don’t even try to use meters in you 2D reference system. Pixel is just fine.

Just to clarify: use whatever metric you want for your model, just don’t think your 2D space in meters, but in a virtual unit like pixels.

Sure here’s a bunch of data I can make more if needed.

#2D input array
image_points = np.array([
(642.5 534.08333333)
(642.5 491.75 )
(601.25 438.83333333)
(628.75 396.5 )
(711.25 396.5 )
(752.5 460. )
(780. 417.66666667)
(725. 407.08333333)
(560. 481.16666667)
(628.75 417.66666667)
(628.75 396.5 )
(766.25 544.66666667)
(780. 449.41666667)
(793.75 438.83333333))
], dtype=“double”)

I measured the distance between a pair of these points and the real object
and the distances match.

#3D model output array
model_points = np.array([
(0.01029588, -0.07928756, 0.2148521),
(0.01039588, -0.04288756, 0.2152521),
(-0.05040412, -0.03658756, 0.2656521),
(-0.05760413, -0.03818756, 0.3450521),
(0.003995877, -0.05208756, 0.3811522),
(0.07219588, -0.03498756, 0.2654521),
(0.07729588, -0.03768756, 0.3433521),
(0.01839588, -0.05188756, 0.3827521),
(-0.06990413, -0.07928756, 0.2287521),
(-0.06490412, -0.06758755, 0.3334521),
(-0.06620412, -0.04538756, 0.3401521),
(0.08859587, -0.07928756, 0.2331521),
(0.08429588, -0.06738756, 0.3314521),
(0.08589587, -0.04548756, 0.3374521)
#the camera matrix in pixels
camera_matrix = np.array(
[[777.173004, 0, 638.1196159],
[0, 778.566312, 336.5782565],
[0, 0, 1]], dtype=“double”

PNP Start

#The output Rotation Vector
[[ 0.11160132]

#The output Translation Vector:
[[ 0.0896325 ]

#The Rotation Matrix:
[[-0.91550667 0.00429232 -0.4022799 ]
[-0.15739624 0.91641548 0.36797975]
[ 0.37023501 0.40020526 -0.83830888]]

#The Euler angles I get
[[ 154.48037976]
[ -21.73011189]

I measured the distance from camera to the object when I took the picture and it is about 40cm. That is pretty close to the -0.36cm shown in the Tvec just the sign is swapped.

Here is a picture of my detected object, the white circles are the 2D input image points and the red are the reprojected points from cv2.projectPoints. I masked out the actual product in photoshop.

As a test in Unity I placed my object and camera facing each other at 0,0,0. Then I dropped in the jpg picture above. Next I set my camera vFoV to 50.2 the vFoV from the camera datasheet. To get the two to line up I had to do the use the following settings for the model’s rotation and position:

So I expected to get a translation and rotation out of this that matched what I needed to do by hand in Unity but for sure I am making some mistakes :slight_smile:

“new users can only put one embedded item per post :)”

Thank you. I think that all of my 3D and 2D points are correctly associated with one another. Of course I could be wrong. But when I reproject the points back into the image everything is pretty close. I added a picture of the output in this thread.

the Tvec’s length is 0.407 so that agrees with your measurement.

apart from the -.37, there’s also a -.15 and .09.

  • can you make those vanish if you move your camera such that the object’s origin is in the center of the picture?

  • now move the camera such that the object’s origin sits in the UPPER half of the picture (and horizontally in the center). what’s the Tvec now?

  • move the object’s origin to sit in the right half of the picture, vertically centered. Tvec?

it’s possible that the algorithm assumes a strange camera coordinate system… x right, y up, z coming at you from out of the screen. that would surprise me though because everything assumes x right y down z far.

try the same experiments with the object’s rotation relative to the camera. Rvecs are awkward constructions but try and see if you can get it to line up with a principal axis and have an expected value like pi/4 for 45 degrees and such.

perhaps there’s ambiguity in the algorithm that isn’t compensated for. someone said it might be giving you values that sit behind the camera’s image plane. if it was just the Tvec I’d say take its negative and that ought to be good, but I have no intuition for how to handle the rotation part (no matter if it’s an rvec or rotation matrix).

One wild thought: if the model has a symmetric point cloud, may be correspondences come from a mirrored version of the model, and PNP finds it behind the camera, thus de negative z.

Alright I found one mistake that I made in the 3D object points. I used Unity to generate them for my model by placing the model at 0,0,0 and using the world coordinate system. When I instead use the local coordinate system things work better.
Now I get model points like this:
model_points = np.array([
(0, -0.0401, -0.0045),
(0.0001, -0.0037, -0.0041),
(-0.0607, 0.0026, 0.0463),
(-0.0679, 0.001, 0.1257),
(-0.0063, -0.0129, 0.1618),
(0.0619, 0.0042, 0.0461),
(0.067, 0.0015, 0.124),
(0.0081, -0.0127, 0.1634),
(-0.0802, -0.0401, 0.0094),
(-0.0752, -0.0284, 0.1141),
(-0.0765, -0.0062, 0.1208),
(0.0783, -0.0401, 0.0138),
(0.074, -0.0282, 0.1121),
(0.0756, -0.0063, 0.1181)

And my new Tvec is:
Translation Vector:
Rotation Matrix:
[[-0.91550665 0.00429236 -0.40227994]
[-0.15739626 0.91641544 0.36797983]
[ 0.37023505 0.40020534 -0.83830883]]

My signs are still wrong and my rotations are not quite right.
When I input this into unity I get this which has the object behind the camera.

But if I negate everything I get this which is much better.stage2

The rotation is still wrong but if I add 180 to Y and negate x and z, it looks pretty good.

Not sure why though :wink:

Also it looks like rotating the model_points 180 around the x then z axis also fixes the problem.

rmatrix = np.array([
    (-1, 0, 0),
    (0, 1, 0),
    (0,  0, 1)

model_points = np.matmul(model_points, rmatrix)

rmatrix = np.array([
     (-1, 0, 0),
     (0, -1, 0),
     (0,  0, 1)

model_points = np.matmul(model_points, rmatrix)


That’s great!

I strongly believe your have a mirrored object, and that’s why it is projected behind the camera. If you can, I suggest trying with a mirrored model, not a rotated one.

Sorry to necromance this thread but I found something else interesting about this problem. I ported my python code to c++ for use on an Android device. To test it I use the same input image and same tflite model which gives me the same set of input points in the same order.

The rotation and translation vector of the python code only match the c++ code when I do not apply the two extra matrix rotations. So for some reason with the same model and inputs the c++ and python are coming out different for me.

@eric_engineer : I am a bit late for this thread, but I figured out what is happening.
Actually there is an unstability in solvePnP algorithm as it doesn’t have a way of restricting the search space in camera coordinates, i.e asking tvec z-component to be always greater than 0.

Even while doing AFLW head pose estimation and using MPII reference face landmarks, on Python > 3.7, Opencv > 1.1.0, we shall encounter these “unphysical” solutions. Actually when we solve for PnP, the 2D coordinates are computed using the perspective projection, which is X/Z and Y/Z of the world coordinate. Now, if X \rightarrow -X, Y\rightarrow -Y, Z \rightarrow -Z the 2D coordinates would still remain same. But we will end up getting tvec z-component as negative.

A hack to solve in such a case, as you earlier discovered is to rotate the values of the world coordinates and do it, or the other thing one can do is translate the world coordinates to the negative z-axis in the world reference frame. Recalculating PnP again and adjusting for the earlier transformation you will get intended values.