Problems with solvePnPRansac

Hi everyone,
i’m trying to implement a SLAM package using opencv, but I have some problems with solvePnPRansac (and solvePnP too). I’m using opencv 3.4.14 and solvePnPRansac never finds any inliner, while solvePnP always returns wrong results.
I have an Intel Realsense depth camera (D435 Depth Camera D435 – Intel® RealSense™ Depth and Tracking Cameras) and i’m trying to reconstruct the pose of the camera. To do this, the workflow is:

  • capture frame at time t-1
  • capture frame at time t
  • extract keypoints (set to 500, ORB), compute descriptors (ORB) and match them (with a robust matcher as in this example opencv/RobustMatcher.cpp at 3.4 · opencv/opencv · GitHub)
  • reconstruct the 3d keypoints coordinates at time t-1 thanks to the depth image
  • use the 3d keypoints at time t-1 and the 2d matching keypoints at time t to solve the PnP problem.

The code of my function is

 bool poseEstimation3d2d(const Frame::Ptr &current_frame, const Frame::Ptr &last_frame, std::vector<cv::DMatch> &good_matches,
                            double depth_scale, int min_depth, int max_depth, const cv::Mat_<double> &K, const cv::Mat_<double> &D,
                            cv::Mat &t, cv::Mat &R)
        std::vector<cv::Point3d> pts_3d;
        std::vector<cv::Point2d> pts_2d;

	//Foreach match it checks if the keypoint at time t-1 has valid depth, else discard it, then reconstruct the 3d pose of the keypoints at time t-1.

        for (cv::DMatch &m : good_matches)
            cv::Point2f kp = last_frame->keypoints[m.trainIdx].pt;
            unsigned short d = last_frame->depth.ptr<unsigned short>(int(kp.y))[int(kp.x)];

            if (d <= min_depth || d >= max_depth)
            double depth = d * depth_scale;

            cv::Point3d p3;
            image_utils::point2dTo3d(kp, depth, K, p3);

	//if it found more than 6 points
        if (pts_3d.size() >= MIN_PNP_POINTS)
            cv::Mat r, inliners;
            cv::solvePnPRansac(pts_3d, pts_2d, K, D, r, t, false, 100, 8.0F, 0.99, inliners);
            cv::Rodrigues(r, R);
            return true;
        return false;

where depth_scale is to transform depth measure from mm to meters (0.001), and min_depth and max_depth are the limits of the depth for the camera (110mm-10000mm, 11cm-10meters), K is the camera intrinsic matrix and D is the vector of distortion coefficients, MIN_PNP_POINTS = 6 and the function point2dTo3d is

 void point2dTo3d(const cv::Point2f &point2, double depth, const cv::Mat_<double> &K, cv::Point3d &point3)
        point3.x = (point2.x - K(0, 2)) * depth / K(0, 0); // (x - cx) * depth / fx
        point3.y = (point2.y - K(1, 2)) * depth / K(1, 1); // (y - cy) * depth / fy
        point3.z = depth;

The method solvePnPRansac always return false, with 0 inliners. I tried to set more keypoints to extract (2000), more iterations in Ransac (500), less confidence (0.95 and 0.8). In addition, t and R returned are totally wrong.
These are the values printed for the first three steps with the camera fixed

t=[-2.521665 -6.966232 71734.524631]
R = [0.463265 0.040670 -0.885286]
    [-0.883486 0.099604 -0.457747]
    [0.069562 0.994196 0.082074]
t=[ -8.086195 -22.291806 433778.025662]
R=[ -0.910810 0.067253 0.407311]
  [ -0.387687 0.199697 -0.899900]
  [ -0.141859 -0.977547 -0.155813]

t=[ -23.359800 -64.562378 1107890.188865]
R=[ 0.480159 0.164383 -0.861641]
  [ -0.819301 0.434948 -0.373585]
  [ 0.313358 0.885324 0.343523] 

I tried solvePnP instead of solvePnPRansac but the values are similar and they change every iteration, with the camera fixed always in the same point. I tried also using float instead of double in K, D and the cv::Points, but the situation is the same.

I know that t and R are not the movement made by the camera in real world coordinates, as explained here Dissecting the Camera Matrix, Part 2: The Extrinsic Matrix ← but with a fixed camera i cannot understand why they change in every iteration and why solvePnPRansac never finds inliners.
Sorry for the long question, but i want to explain clearly the problem. Can anyone help me please?

I’d suggest debugging the yield of feature descriptors (how many do you get) and the matching step (how many matches).

also PLEASE use a decent descriptor such as AKAZE or SIFT.

make sure you use Lowe’s ratio test to filter the matches. you’ll find that in one of the tutorials in the docs: OpenCV: Feature Matching

thank you for the reply. I changed ORB to AKAZE, and i’m implementing Lowe’s ratio for feature matching (with constant = 0.7).
I’ve changed a thing in feature detection, using a mask where depth is not valid, so keypoints cannot be detected in those regions. Before, I’ve been solving this problem discharging keypoints for which the depth was invalid (it’s the commented part in the code below).
The descriptors extracted are tipically around 450-480 and the good matches are around 430. Using the function drawMatches() I visualize the matches and they seem pretty accurate (not visibly different to ORB).

But when I use solvePnPRansac there are always 0 inliners and the value of t and R are similar to those previously posted. In particular t vector is totally wrong (an example t = [-5.703914, -14.053492, 25970638.551577]).

This is my function

bool poseEstimation3d2d(const Frame::Ptr &current_frame, const Frame::Ptr &last_frame, 
                        std::vector<cv::DMatch> &good_matches,
                        const Camera::Ptr &camera_info, cv::Mat &t, cv::Mat &R)
    std::vector<cv::Point3d> pts_3d;
    std::vector<cv::Point2d> pts_2d;

    for (cv::DMatch &m : good_matches)
        cv::Point2f kp = last_frame->keypoints[m.trainIdx].pt;
        unsigned short d = last_frame->depth.ptr<unsigned short>(int(kp.y))[int(kp.x)];
        if (d <= camera_info->min_depth_mm || d >= camera_info->max_depth_mm)
        double depth = d * camera_info->depth_scale;

        cv::Point3d pt3;
        coordinate_conversions::point2dTo3d(kp, depth, camera_info->K, pt3);
    if (pts_3d.size() >= MIN_PNP_POINTS)
        cv::Mat r, inliners;
        bool solved = cv::solvePnPRansac(pts_3d, pts_2d, camera_info->K, camera_info->D, r, t, false, 100, 8.0F, 0.99, inliners, CV_EPNP);
        cv::Rodrigues(r, R);
        return true;
    return false;

where i changed few things, but it’s almost the same respect to the previously posted. I thought it could be a problem of type conversion (float/double and vice versa) so I always explicit the cast where necessary, but the result doesn’t change.
The camera is fixed on a tripod on the table so i don’t understand why, frame by frame, t and R change and inliners are never found.

Edit: I tried also with Lowe’s constant = 0.3 and 0.1, the good matches decrease but remain around 300 and the result of solvePnPRansac doesn’t change.

why do you use solvePnPRansac? you don’t have actual “model” points. to me this looks like structure from motion and that function isn’t suitable for that task.

you also have to realize that feature descriptors DO NOT maintain any order. nor do these methods guarantee that they’ll find the same points again in the next picture. solvePnPRansac does require both model and image points to be in order (to form matching pairs).

Ok, thank you. I didn’t know the second part about order. For the first part, I have a depth camera, so thanks to depth information I thought i could reconstruct the 3d coordinates of the points. Am I wrong?

if you have a depth camera, you ALREADY have the 3d coordinates of (almost) every pixel. consult the documentation for your camera as to what the values in the depth image mean and how to calculate cartesian coordinates from these.

if and only if you have a disparity map (that’s not a depth map!), look at reprojectImageTo3D, which is a method that’s used on disparity maps to produce coordinates for every pixel.

if you wanted to do SLAM, you’d fuse point clouds from multiple frames over time. one algorithm for that is called “iterative closest point”. it aligns two clouds.

SLAM is a lot more complex than I care to explain here. you’ll need to read up on it. I can tell you that solvePnPRansac is absolutely the wrong thing, it doesn’t apply at all here. it will not solve your problem.


You matches are perfect. You can tell from the image there are no outliers, or if there are, they are so few.

One thing that puzzles me: Are all those matches reaching solvePnPRansac? log pts_3d.size() before solvePnPRansac, check there are 200+ points feeding into PnP, and not 6. Those strange poses you get can be caused by very few matches in PnP.

Keep in mind depth hasn’t the same resolution as the pixel x,y coordinates. I would try increasing the reprojection error allowed in solvePnPRansac. First with a big error (like 100), then lowering it and counting the inliers until they reach zero. Few matches in PnP lead to big pose error.

About rotation and translation: Rotations are always in radians, and in PnP translations are in real world scale: the same scale your 3D points use.

If your camera is fixed, it doesn’t translate, only rotates, the you should get near zero translation. If you don’t rotate, rotation should be near zero.

Thank you for the quick replies. I already read a lot about SLAM, and I know ICP algorithm, but i would try to apply 3d-2d motion estimation with PnP as explained here
I also read this book about SLAM in c++ slambook-en/slambook-en.pdf at master · gaoxiang12/slambook-en · GitHub and in chapter 7 it uses similar code, using solvePnP as you can see here slambook2/pose_estimation_3d2d.cpp at master · gaoxiang12/slambook2 · GitHub so I think the method is not wrong, but i’m wrong in some part of the process.
I will also try 3d-3d pose estimation with ICP, hoping to achieve a better result. Thank you!

ah, I see now how they use this method and why it’d involve RANSAC. my bad!

Hi, thank you for the reply.
As you can see from the screenshot below, the depth and the gray images have the same dimension (i print rows and cols).

Edit: probably uploading the file the image quality is decreased, but depth image is quite accurate, while here it seems “total black”

Visibly, they seems not perfectly aligned, even if I’m using Intel Realsense library with the parameter align_depth set True, that aligns the rgb and the depth image as explained here GitHub - IntelRealSense/realsense-ros: Intel(R) RealSense(TM) ROS Wrapper for D400 series, SR300 Camera and T265 Tracking Module

align_depth : If set to true, will publish additional topics with the all the images aligned to the depth image.
The topics are of the form: /camera/aligned_depth_to_color/image_raw etc

The depth values are single channel unsigned short that are the distance in mm, so i transform it to meter as shown in the code above, multiplying it for camera_info->depth_scale that is 0.001.
In the console I log the descriptors extracted, the descriptors matched and the vector of Point3d (p3f in the console) given to solvePnPRansac and they are around 250. Then I try to increase the reprojection error to 100, and also 200, I set a lower confidence (0.8), more iterations (500), but always 0 inliners. It’s several weeks that I try to understand why solvePnP return bad results and I’m loosing hope.

I’m into solving this problem. Here is my proposal: do some checks, if they fail share data with me, I’ll test it on my computer. This means some work to you, so, do it or not, as you wish.

First check this
Calibration is often an issue, specially distortion coefficients.

As a quick check please generate an undistorted image and look if it sees normal or so wrong.

There is one problem in your code. You are supplying solvePnpRansac with K and D (distortion coefficients) the function will apply only to 2D points, not to 3D points. So training keypoints are never undistorted. pt3 points must be undistorted before adding depth info: coordinate_conversions::point2dTo3d is not applying undistortion.

Share data
If the problem continues, persist 3D-2D matches into a YAML or JSON file using OpenCV: XML/YAML Persistence . Upload to Drive, Dropbox, Github or whatever and post the link to download.

Use some selfdescribing names, like

  • pts_3d
  • pts_2d (matching pts_3d)
  • K
  • D

Thank you for the reply, but distortion coefficients are all 0 in D400 series (rs2_intrinsics coeffs[] all 0 by default · Issue #1430 · IntelRealSense/librealsense · GitHub), I also tried to print them and they are all 0. I think the camera is well calibrated, because using It with other tools such as RTAB-MAP for example, It works well.

I think I’ve solved the problem. In order to check if the problem is only in 3d-2d pose estimation, i tried with findEssentialMat and recoverPose to try the 2d-2d pose estimation. In this case, the translation vector and the rotation vector were quite accurate, with the same matches, so I understood that I was wrong something in solvePnPRansac. In fact, passing cv::Mat() instead of the D vector of all 0, the r vector and t vector, with the camera fixed, are around 0. I don’t know why, probably passing an array of all 0 makes this function going wrong, even if in the documentation it seems to be the same things.
From the function solvePnPRansac here OpenCV: Camera Calibration and 3D Reconstruction

distCoeffs: Input vector of distortion coefficients (k1,k2,p1,p2[,k3[,k4,k5,k6[,s1,s2,s3,s4[,τx,τy]]]]) of 4, 5, 8, 12 or 14 elements. If the vector is NULL/empty, the zero distortion coefficients are assumed.

Edit: yes, now it finds inliners!! Thanks for the help!

1 Like


Great job!

I believe when you give an empty Mat, the function knows undistortion is not needed. But if you give a zero array, the function needlessly calls undistortion with that array. And then, by black magic, it doesn’t work.