About the state of the art on solvePnP correspondence problem

Hello,

I have been trying to use SolvePnPRansac to find the pose of an object from a single monocular camera (see my post), but I am facing the correspondence problem. So far I only know about 2 ways of solving this : SoftPosit and brute forcing solvePnP.

Considering the fact that SoftPosit is pretty old and not very democratized, I have been trying to brute force solvePnPRansac.
So I take 4 points from my image, and 4 random points from my model, and I use the reprojected points and compare their squared distances to the original points to make a first filter. I keep the 20 best.
With the 20 best guesses of SolvePnPRansac, I then apply a second filter to decide which guess is the best, this time comparing all visible 2D points with all reprojected points, not only 4.

Some results (red = point position, yellow = reprojected position):

Here is an aberration (happening about 50% of the time) :

So for 10 model points, the complexity of the brute force method is 10!/(10-4)!, so 10*9*8*7. So I make 5040 calls to SolvePnPRansac every frame, which is bad but I know i can improve that with a tracker later.

The main issue is that my results are not stable at all. Does anyone know how I can improve my brute force accuracy, or if there are any other solutions for monocular pose estimation without involving tags or AI to position a 3D object ?

I can’t really tell what you are showing in the pictures. Are the yellow points in the “50% of the time” image supposed to project to one of the visible markers, or are they a marker that is occluded?

When you say your results are not stable at all, do you mean that you will get significantly different pose estimations from one frame to the next (true erroneous results) or that there is a lot of jitter, but the results frame-to-frame are still reasonable estimates?

If it’s the former (I think this is what is going on?) I would look at the erroneous results (and where the model points project to) and try to understand why the algorithm chose that pose (a bad one) over the correct one. You might need more points.

If it’s the latter I would do a few things:

  1. examine the jitter in the image points. Things like corners are a lot easier to get a consistent location for than more generic features.
  2. One you have chosen a pose that fits you data well, call solvePnP with all of the point correspondences (within some amount of reprojection error / filtering for occluded points / etc.) You might already be doing this, but it’s not clear from your description. The point here is to not just use the result from solvePnPRansac() directly (it sounds like you are calling it with just 4 points) - instead call solvePnP (or solvePnPRansac) again with a larger set of points once you have gotten a good estimate (and therefore correspondences)

Am I’m understanding correctly that you are calling SolvePnPRansac with just 4 points each time? If so I think there might be a misunderstanding about what it does. The idea behind SolvePnPRansac is that it works well with noisy data by picking a small subset of points to estimate the pose and then validates that estimate by projecting the remaining points. Initial estimates that result in low reprojection error for the rest of the data set are considered good estimate, and ones with high reprojection error are bad estimates (typically because the image locations had error). The key think to remember is that SolvePnPRansac needs to have point correspondences from the beginning - it doesn’t try to assign the correspondences from two unordered lists. Also it solvePnPRansac() is typically used on larger data sets - calling it with 4 points is kind of a degenerate case.

My suggestion would be to call SolvePnP for your 4 point data sets, and then call solvePnPRansac on the full data set once you have determined your correspondences.

(to be fair I don’t think it really matters that you are using SolvePnPRansac on small data sets - I believe it will just default to solvePnP behavior, but I’m suggesting you just call SolvePnP directly to be clear that you aren’t expecting anything magic to happen with SolvePnPRansac. )

Also regarding performance: You don’t have to search the full space each time. For example if you had 20 points it wouldn’t be possible to brute force it, but you could apply your method and stop once your best reprojection error estimate had stabilized. Say do it for 20 iterations to start with, and then stop once your best reprojection error hasn’t improved in 10 iterations. (There are certainly more sophisticated / robust / valid ways to do this, but something like this is easy and will get you started)

Also, once you have your correspondences you will be able to use the previous image locations to start with better guesses, etc. If you get to the point of using a Kalman filter you will be able to track your correspondences frame-to-frame and reduce jitter.

I can’t really tell what you are showing in the pictures. Are the yellow points in the “50% of the time” image supposed to project to one of the visible markers, or are they a marker that is occluded?

The red points are my detected points on the image.
The yellow points are the reprojection of all the model points into the image. It is the estimated solution by my brute forced solvePnP.

When you say your results are not stable at all, do you mean that you will get significantly different pose estimations from one frame to the next (true erroneous results) or that there is a lot of jitter, but the results frame-to-frame are still reasonable estimates?

It is the former. I already have 10 points, and evaluate 4 of them, which is supposed to give a unique solution. I would not like to evaluate 5 points because this will be a rare case unless I put a tremendous amount of markers all around my object. It would also considerably increase the computation time.

My suggestion would be to call SolvePnP for your 4 point data sets, and then call solvePnPRansac on the full data set once you have determined your correspondences.

Well I have not noted a particular difference between SolvePnP and SolvePnPRansac on the final result, so I chose SolvePnPRansac since it is 2-3 times more optimized in my case (from 30ms to 85ms). I will use SolvePnP to be sure.

Also regarding performance: You don’t have to search the full space each time. For example if you had 20 points it wouldn’t be possible to brute force it, but you could apply your method and stop once your best reprojection error estimate had stabilized. Say do it for 20 iterations to start with, and then stop once your best reprojection error hasn’t improved in 10 iterations. (There are certainly more sophisticated / robust / valid ways to do this, but something like this is easy and will get you started)

I did not get this part. There is only one solution, so I can only brute force all the configurations. What do you mean by iteration ?
I do not search the full space. As said I try all 4 3D points configurations with 4 chosen 2D points found on my image. I don’t try with 10 points when I have 10 2D points, this would not be possible.

Ex:
I have 5 2D points on my image called a b c d e. I choose arbitrarily a b c d.
I have 10 3D points corresponding to my markers location called 0 1 2 3 4 5 6 7 8 9.

Now I try all 3D points configuration, so I try :
(a, b, c, d) - (0, 1, 2, 3)
(a, b, c, d) - (0, 1, 2, 4)
(a, b, c, d) - (0, 1, 2, 5)
(a, b, c, d) - (0, 1, 2, 6)

In the end I evaluate the reprojection and find that for example (a, b, c, d) - (4, 7, 2, 0) is the best match.
However there are several best matches. I try to discriminate them by using the last unused point e and see if one match has a point where it corresponds well. This match is our solution.

I could do differently to find the best of all the best matches using solvePnPRansac with all my data points as you said instead of 4, however this can only work with more than 4 visible 2D points like my previous solution of using the last e point.

I understand now. What you are doing makes sense and you do have to do the full search to be sure - I was imagining a situation where you randomly chose the model points and images points each time. As long as the 4 image points you choose are high quality (represent actual markers and have good image location estimates) this should work fine.

I want to stress that this probably isn’t the best way to do this and I have to imagine there are much better approaches in the literature. For example if you could use a single Aruco you could get a decent initial pose estimation which you could use to turn your brute force search into an informed search. Maybe you can’t use an Aruco, but you could use different colored markers. Maybe most are blue but you have 2 or 3 red ones (with a guarantee that at least one red one is visible) If you see a red one you know it has to be one of the 3 from your data set and you can therefore reduce your search space.

Your 3D model points have to be accurate, too. How are you measuring those points? How are you processing the image to get good / consistent image locations? Find a blob and compute the center of mass? Is your point localization invariant to scale/perspective changes? Could you use features with corners? Maybe circles (which project to the image as ellipses) and fit the ellipse and account for the perspective distortion to get an estimate of the center (I don’t think COM is perspective invariant, so you can’t just use that) etc.

You have made a lot of progress quickly, which is encouraging. If you are willing to keep improving / refining / learning / applying you are probably going to be able to get good results.

“Multiple View Geometry” by Richard Hartley and Andrew Zisserman is one of my favorite books for understanding cameras / calibration / pose estimation etc. Don’t be misled by the “Multiple View” part - it covers a lot of the foundational stuff before it gets into multi-view geometry, and I have found it very helpful. (Homographies, camera models, projective geometry, etc.) Very useful book, but it is pretty focused, so you will probably want others. Gonzalez/Woods is (or was) the go-to book for image processing. “Three-Dimensional computer vision” and “The Geometry of Multiple Images” (Both Olivier Faugeras) are well regarded - pretty math heavy/theoretical as I recall) As I have mentioned, academic papers are a good bet for getting ideas to solving the practical / engineering issues.

Good luck.

You are very much right to make a point on point extraction quality as I currently use colored squares (it’s just some tape) on which I compute the geometrical center. I’ll switch to smaller and circular shapes it can only get better.
I tried colored markers for determining point correspondance previously, but it was not robust. However you are right when saying that I can greatly decrease the research complexity if I can see a known unique marker.
My 3D model points are not exceptionnal either, I think I will 3D print an object and fill in premade locations for my markers, therefore the dimensions will be well known (I used a simple ruler previously).

I am not a big book reader to be honest, but I do read plenty of research papers. I saw a pretty interesting paper called “SoftPOSIT Enhancements for Monocular Camera
Spacecraft Pose Estimation”, a much modern approach for SoftPosit. However there is no provided code, gotta make it myself.

Thanks for your answers. :slight_smile:

So I made this, a piece whith well known 3D points and smaller markers :

I could not notice any improvement. It seems like you cannot brute force solvePnP because there are too much best matches.

So I guess I will go on SoftPosit or find a way to get my correspondence as I did before, but with a more robust and and generic way.