SolvePnP keep crashing for unknown reasons

Hello everyone,

I am writing this post since I am encountering an issue with solvePnP and solvePnPRansac which I am having a hard time to debug.

So I have been following this tutorial (Head Pose Estimation using OpenCV and Dlib | LearnOpenCV #) to learn about the implementation of solvePnP in OpenCV. The code works on my side, using the given image points, 3D points, camera matrix and distortion coefficients. It also works when I use my own camera matrix and distortion coefficients. So this works :

    // 2D image points
    std::vector<cv::Point2d> image_points;
    image_points.push_back(cv::Point2d(359, 391));     // Nose tip
    image_points.push_back(cv::Point2d(399, 561));     // Chin
    image_points.push_back(cv::Point2d(337, 297));     // Left eye left corner
    image_points.push_back(cv::Point2d(513, 301));      // Right eye right corner
    image_points.push_back(cv::Point2d(345, 465));    // Left Mouth corner
    image_points.push_back(cv::Point2d(453, 469));    // Right mouth corner
    cv::Mat pts = cv::Mat(image_points.size(), 2, CV_64F, image_points.data());

    // 3D model points.
    std::vector<cv::Point3d> model_points;
    model_points.push_back(cv::Point3d(0.0f, 0.0f, 0.0f));                   // Nose tip
    model_points.push_back(cv::Point3d(0.0f, -330.0f, -65.0f));          // Chin
    model_points.push_back(cv::Point3d(-225.0f, 170.0f, -135.0f));       // Left eye left corner
    model_points.push_back(cv::Point3d(225.0f, 170.0f, -135.0f));        // Right eye right corner
    model_points.push_back(cv::Point3d(-150.0f, -150.0f, -125.0f));      // Left Mouth corner
    model_points.push_back(cv::Point3d(150.0f, -150.0f, -125.0f));       // Right mouth corner
    cv::Mat mpts = cv::Mat(model_points.size(), 3, CV_64F, model_points.data());

    std::cout << "Pts\n" << pts << "\n";
    std::cout << "3dPts\n" << mpts << "\n";
    std::cout << "Camera Matrix:\n" << cam_mat << "\n" ;
    std::cout << "Distortion:\n" << dist_coeff << "\n" ;
    // Output rotation and translation
    cv::Mat rvec(3, 1, CV_64FC1, cv::Scalar::all(0));
    cv::Mat tvec(3, 1, CV_64FC1, cv::Scalar::all(0));

    // Solvepnp
    //cv::solvePnPRansac(mpts, pts, cam_mat, dist_coeff, rvec, tvec);
    cv::solvePnP(mpts, pts, cam_mat, dist_coeff, rvec, tvec);

    // reprojecting
    std::vector<cv::Point2d> res;
    cv::projectPoints(model_points, rvec, tvec, cam_mat, dist_coeff, res);
    for(int i=0; i < res.size(); i++)
        cv::circle(frame, res[i], 3, cv::Scalar(0, 255, 0), -1);

But here is the catch : when I use my own image points or 3D points the program crashes at the first call of SolvePnP with this data :

        std::vector<cv::Point3d> model_points;
        model_points.push_back(cv::Point3d(1.7f, 0.0f, 0.0f));
        model_points.push_back(cv::Point3d(-1.7f, 0.0f, 0.0f));
        model_points.push_back(cv::Point3d(0.0f, 1.7f, 0.0f));
        model_points.push_back(cv::Point3d(0.0f, -1.7f, 0.0f));
        model_points.push_back(cv::Point3d(2.2f, 0.0f, 3.3f));
        model_points.push_back(cv::Point3d(-2.2f, 0.0f, 3.3f));
        model_points.push_back(cv::Point3d(0.0f, 2.2f, 3.3f));
        model_points.push_back(cv::Point3d(0.0f, -2.2f, 3.3f));
        model_points.push_back(cv::Point3d(2.6f, 0.0f, 6.5f));
        model_points.push_back(cv::Point3d(-2.6f, 0.0f, 6.5f));
        model_points.push_back(cv::Point3d(0.0f, 2.6f, 6.5f));
        model_points.push_back(cv::Point3d(2.0f, 0.4f, 9.7f));
        model_points.push_back(cv::Point3d(2.0f, 0.4f, 9.7f));
        cv::Mat mpts = cv::Mat(model_points.size(), 3, CV_64F, model_points.data());
terminate called without an active exception
Aborted (core dumped)

When replacing both image points and 3D points it also crashes, and when replacing only image points it crashes too. I verified my data it does not contain any NaN or near infinity values.

Here is what I get when replacing both image points and 3D points by my data in the console with the prints :

Pts
[6, 696;
 17, 601;
 82, 671;
 255, 711]
3dPts
[1.700000047683716, 0, 0;
 -1.700000047683716, 0, 0;
 0, 1.700000047683716, 0;
 0, -1.700000047683716, 0;
 2.200000047683716, 0, 3.299999952316284;
 -2.200000047683716, 0, 3.299999952316284;
 0, 2.200000047683716, 3.299999952316284;
 0, -2.200000047683716, 3.299999952316284;
 2.599999904632568, 0, 6.5;
 -2.599999904632568, 0, 6.5;
 0, 2.599999904632568, 6.5;
 2, 0.4000000059604645, 9.699999809265137;
 2, 0.4000000059604645, 9.699999809265137]
Camera Matrix:
[1583.8995, 0, 990.13501;
 0, 1585.4629, 620.50098;
 0, 0, 1]
Distortion:
[-0.37840971, 0.18232454, -0.0064165886, 0.001675527, -0.017335379]
terminate called without an active exception
Aborted (core dumped)

You might notice that I get my points into a Mat since vectors can apparently cause issues for SolvePnPRansac : opencv - SolvePnP works unstable and crashes - Stack Overflow
SolvePnPRansac also crashes. I already recompiled OpenCV to an earlier version than before (now 3.4.14), this did not fix the error.
I know silent crashes like this are not supposed to happen and this is likely to be a data type issue, but I cannot find it.

Thanks for your help !

model and image points have to be the same amount.

and afaik they also have to pair up so you can’t just give image points in a random order.

Oh ok so same amount and same order.

This poses me another issue then, how can I know which 3D model point corresponds to which 2D image point ?
Also I have occlusion since my points are all around a cylinder so I can sometimes see 4 points, sometimes 6, thus the amount is never the same.

It seems like there are no tutorials of people achieving this with OpenCV, however I’ve seen demonstrations of this working. Is solvePnP not the right approach to perform pose estimation of my cylinder ?

you get points from a facial landmark model, right? those have identity. it’s not a set, it’s a list.

ah, now you reveal what you’re dealing with… keep going. saying “a cylinder” is far far from sufficient. don’t expect people to be mind readers.

I’m sorry I did not explain my problem. So I will detail it here. I have a screwdriver on which I put green markers (so almost a cylinder).

1

I am able to extract these markers using the right HSV values and create a mask to retrieve all of them.

3

2

Using this mask I find the image contours, highlighted in blue on the picture (don’t mind the red dot). I then create an array of cv::Point corresponding to each contour center point on the image. So these are my image points.

Now I want the pose estimation of my screwdriver using these image points, and knowing their 3D disposition in the screwdriver world. This is possible using fiducial markers like aruco since they have a unique ID, but Aruco is planar pose estimation, and they don’t work at long distances. The whole point is to achieve pose estimation using only the data I described (or not a lot much).

okay, I think I can work with that.

you can recover a pose from a partial set. if you can figure out the identities of the points you do have, you will know what model points to pair them with, and what model points to omit.

it’s good that you have relatively few markers, not hundreds or more. this means it’s feasible to try all combinations brute-force. brute force will be costly still, and it can be tricky to figure out if a pairing makes sense (the recovered pose is sensible).

am I guessing right that the tip-most two rings of markers don’t rotate? or is that the chuck and they will rotate relative to the drill’s body?

practical tip: try retroreflectors, available as sticky tape. give the camera some IR light (and make sure it doesn’t have an IR filter). retroreflectors will light up just perfectly.

another practical tip: laser cutter or foil cutter. maybe not every copyshop will have a foil cutter but t-shirt printing shops probably do, or know who does.

you would benefit from giving these markers some disambiguation. use different shapes that are easy to tell apart from the contour. circle, triangle, square, 3/4/5-pointed star (if that’s still clear to see)… a hole in the middle adds another bit of information.

maybe apply more structure. apply strips (lengthwise) that you can see, then put a few shapes on one strip in different orders (shapes would be cut out, i.e. black). you can tell a strip’s identity from the sequence of shapes on it… and that gives those individual shapes identity too. you’ll often be able to see the whole strip so I think that’s feasible.

Indeed I saw a demonstration of such concept where the company was using IR reflective markers, some use IR emitters directly, also saw a student project where they were using retroreflectors. IR is definitely the way to go when seeking for a robust points detection.

I’m still working with my humble colored tape pieces, which don’t work so bad. I’m just trying to make the math work for now. So indeed if I want to match the points together I need some more information. Circular fiducial markers do exist (https://hal.archives-ouvertes.fr/hal-01420665/document), but just like Aruco I’m frightened about the detection quality at long distances. The goal is to make the markers as small as possible.

So I’m trying to solve my issue with as less information as possible. I’m thinking of 3 ways right now :

  • When the program starts, tell which blobs can be seen. Tracking the blobs continuously could be sufficient, but I’m affraid this would not be very robust.

  • Another way is to have markers, but only a few. One Aruco marker can tell the whole object rotation and direction. Having one Aruco visible at every time should be sufficient to know which point is where afterwards.

  • Register several image of the object with different rotations, just like a training data in deep learning, except this just serves as a basis to know which point is where. We try to match our current image with our images bank, and when we find the closest one we can start anticipating the points (don’t know if I’m clear here).

But anyway I has my answer, solvePnP needs more information than just 2 sets of points. Thanks for your replies ! :slight_smile:

I don’t necessarily mean “circular fiducials”. I literally mean circles (and other simple shapes that are distinguished by shape/number of corners). if some existing fiducial scheme suits you, go ahead. that would solve all the implementation work.

for simple shapes, this is the idea to distinguishing them:

for prototyping, you should cheat and use arucos and a close-up view (or high res camera). easy detection and identification. with the aruco module, you can even define a “board” which is an arbitrary 3D arrangement of markers. as long as you have one marker (which comes with pose…), it will figure out the pose of the whole arrangement.

you can then replace aruco with simple shapes for which you still have to figure out the identity.

separate problems. to build a bridge, first throw a rope over the river.

you could train a neural network on a lot of views and known angles of your drill. there is research on this. it’s not easy and I don’t know if it’s good enough for industrial uses. industry loves reliable approaches and results.

Well I already made an application which uses Aruco boards, but they are huge and cannot be sticked like a sticker on the object since they must be planar.

About the neural network, I have one learning right now. However a discussion with some of the maintainers of the repository I use led to the conclusion that my drill is hard to learn, because it has a very complex shape and it is not convex. 3 days did not suffice, I’m trying on a full week now. :smiling_face_with_tear:

The thing with shape detection is that in the tutorials they are always seen from a perfectly perpendicular view. I am pretty much certain a highly deformed star due to perspective + the deformation induced by the object surface not being planar will produce something horrible in terms of reliability. Well I mean I could use indeed a few shapes to serve as direction indicators to which point is where, since I don’t need their pose, only their direction. That would certainly be smaller than Aruco markers and require less processing time.

I could also use other colors for different markers, but that would increase the image processing time, and I’m trying to make it work on a Jetson nano.

I think the color thing might be a good idea though, I can put one orange marker on each side, so that only one is always visible, then detect where the green markers are compared to this orange marker. That will give me the direction the drill is facing, and also an estimation of “where is up and where us down”. I think I can make something work with this. Or even better use circle instead of square, I know there is a function to determine the circularness of a shape in opencv, that would be less reliable however.

I’ll post progress as I work on it. :slight_smile:

Well, I’m not sure if I should be happy or disappointed. On one hand I can make solvePnPRansac return a pretty coherent result, but on the other hand it is very noisy and moves a lot even when not touching the drill.

So the point here is to use an orange marker to know which direction the drill is facing, and use the aligned green markers to detect 3 dot lines (or 2 dot lines if 1 marker is not extracted properly, it still works). Using the orange marker position and the direction of the drill I can assume where the top is (red line) using the angle from the centroid (green dot). My algorithm is not perfect yet, but the relations between 2D and 3D points are correct.

Here are 2 different results from a video capture (I did not move the drill between the screenshots) :

1

2

I’ll keep working on this to improve pose estimation.

I haven’t read through your posts in detail so I don’t know all of the things you have tried, but I do understand what you are trying to accomplish and the related challenges.

While you might not be able to find OpenCV tutorials on how to accomplish this, a lot of research has been done on this topic. If you are comfortable reading research papers and applying the techniques described, I think that would be your best bet.

This one is very old, but something I’m familiar with so I just grabbed it:
https://www.researchgate.net/publication/2923089_Temporal_Registration_using_a_Kalman_Filter_for_Augmented_Reality_Applications

Reading it and the papers it references would be a good start.
Edit: Also maybe look at the papers that cite that one (there are only 4, which is an indication that the paper I linked isn’t particularly novel or good, but it might still be helpful)
https://scholar.google.com/scholar?cites=5324559098754170416&as_sdt=4005&sciodt=0,6&hl=en

Here are a few tips related to the issues you raised:

  1. The question “how do I know which image point maps to which 3D point” is known as the “correspondence problem” - so using that as a search term will be helpful.
  2. A lot of work has been done on tracking features in an image sequence. Tracking in this case means finding a feature (in your case the tape markers) in one image and finding the same feature in subsequent images in a video sequence. Using the search term “feature tracking” will be helpful.
  3. Read up on image features. I think you will want to search for “feature descriptors” - most of what you will find will relate to identifying/tracking inherent features of an object (not necessarily how to design a feature to be trackable) but it will be good information. You might come up with a better way to design your features, or you might find that you can use the features already present in your tool (corners, labels, screw holes, other markers)
  4. A RANSAC approach seems like it might be a good way to get your initial pose. I’m thinking something along the lines of:
    Detect all the features visible in the image
    Pick a subset of the image points (I think you will want 4, but might be able to get away with 3) and randomly assign 3D points from your model data.
    Estimate the camera pose and then see how well the 3D model points map to the remaining image features.
    Repeat until your reprojection error is acceptable.

Once you get an initial pose that you believe to be correct, tracking the features from one frame to the next (and therefore retaining the image->model correspondence) lets you just call solvePNP on your full set of data. You will have to dynamically add points as they become visible (but you will know where to look for them in the image) and remove points as they become occluded.

To make this more robust (and to smooth out the jitters you complained about) some sort of filtering will be needed. Once you get to this point, I would suggest looking into the Kalman filter. The idea behind the Kalman filter is basically to combine measured data with a predicted data to generate a better estimate than either would provide individually.

Most or all of what I am discussing can be found in the paper linked above.

Good luck with this. This is an ambitious project and is likely much more involved than you originally anticipated, which is true of just about every vision project in history.

You might want to look into SIFT and SURF as well.

Thanks for your really complete message.

I’m trying to make something as generic as possible so I do not wish to track features on a particular object. I did make a few tests with Orb, Sift and Surf, but I feel like it can’t be reliable on a 3D object, and it is also very costly in terms of computation time.
I could indeed try various pose estimation tries and evaluate the reprojection in the first place to have an idea of the pose and thus have a better knowledge of the positions of my points on the next frame, this is an interesting idea, and it might also help me get rid of the orange dots.
I also did not try to tweak the SolvePnPRansac parameters yet.

Well this was a one day piece of work, I am not particularily familiar with image processing but I’m willing to learn. Thanks again for the post. :slight_smile: