I’m working with solvepnp, and looking into other pnp methods out there in academia, and I’m wondering if I’m just not understanding how it works.

So for pnp, the problem is trying to solve for your pose or position in your environment. You need to have the parameters for your specific camera, and then as input the 2D image coordinates as well as the 3D coordinates that go with the 2D ones. What I may be confusing myself on is I see other forum posts speaking of using pnp to get your 2D-3D correspondences, but shouldn’t it be the other way around? You’re using pnp which should mean you already have the 2D-3D correspondences as input right?

And for getting these 2D-3D correspondences, is this normally done manually with you measuring the 3D object physically, or is there some other method that I just don’t know about.

Another thing I would like to know is, what are the limitations of using PnP. I’ve read papers about a number of other pnp methods out there, but I have not seen anyone state what the limitations of it are. Whether its that your points lie on a plane, how far apart these points can be, or what distances can pnp used for(such as using it indoors/outdoors).

So for pnp, the problem is trying to solve for your pose or position in your environment. You need to have the parameters for your specific camera, and then as input the 2D image coordinates as well as the 3D coordinates that go with the 2D ones. What I may be confusing myself on is I see other forum posts speaking of using pnp to get your 2D-3D correspondences, but shouldn’t it be the other way around? You’re using pnp which should mean you already have the 2D-3D correspondences as input right?

solvePnP needs indeed to be given the 2D and 3D points in the right order, so yes with the correct correspondence. The 2D points are the points on your image. The 3D points are the points in your object in the coordinate basis of your choice.

And for getting these 2D-3D correspondences, is this normally done manually with you measuring the 3D object physically, or is there some other method that I just don’t know about.

Putting the right correspondence by hand in your program is easy since you know which point on the image corresponds to which of your 3D points, but the correspondence problem is complex to automate by your computer.
I have been trying to brute force solvePnP to get my object pose without know correspondence, and I had mitigated results, I am going to make another post about this.
Another method which I did not test is SoftPosit, but I have no idea if it works well or not.

Another thing I would like to know is, what are the limitations of using PnP. I’ve read papers about a number of other pnp methods out there, but I have not seen anyone state what the limitations of it are. Whether its that your points lie on a plane, how far apart these points can be, or what distances can pnp used for(such as using it indoors/outdoors).

SolvePnP is pretty good in my opinion. The limitations are that you need at least 4 points (there is a method which uses 3 points but I did not try it yet, feedbacks are appreciated), and your points disposition should not have any symmetry otherwise it might get confused. The distance does not matter, solvePnP does not extract the points from your image itself, it only processes them to return a pose. It is up to you to provide it the right points with the right correspondence. If you think your points extraction is inaccurate, you can use SolvePnPRansac.
For more information, here is a documentation : OpenCV: Camera Calibration and 3D Reconstruction

My understanding is that the 3 point version (you will see it called P3P I think, as in a a special case of PNP where N=3) has an ambiguity (2 solutions). The one you want (where the points are in front of the camera) and the one you don’t want (where the points are projected as if they were behind the camera, but still can be mathematically projected through to the image plane)

Actually it looks like there are 4 possible solutions to the 3 point variant:

It’s also worth noting that there are other degenerate cases where even 10 or 20 points wouldn’t be sufficient. For example if all of the points are on a line in the world. Furthermore it is important to have well calibrated intrinsics for solvepnp to work.

Yes, you are understanding this correctly. You are supposed to have the correspondences ahead of time. The idea being discussed in the other posts is how you could bootstrap the correspondences by randomly assigning them (for 4 points), calling solve PNP and then using the result to project the remaining 3D model points. If the projections of the remaining 3D model points are projected “close” to the identified features in the image then your random assignment of correspondences might be correct. If the reprojection error is high your random assignment was probably wrong and you should try with a new random guess-and-check. This is just an idea for bootstrapping / getting a good initial pose / correspondences, and there are probably many other / better approaches to this.

I don’t know how it’s implemented but I assumed it was just a least-squares problem, so SVD or QR decomposition. Also I’m old and not up to date with recent developments, so maybe it’s something fancier.

I can tell you that I use it regularly and in shipping products and it provides really good results. The main considerations for using it:

You need calibrated and stable intrinsics. If your lens isn’t stable or you are changing zoom levels or focus then you should have a way to account for it. If your optics have high distortion you either need to be using an advanced distortion model (rational works really well, but fisheye might be better in some cases). If you are using the standard k1,k2,[k3] model you should restrict the image points you use to be within the original calibration data set (meaning if you calibrated with points that were all within 700 points from the image center, don’t use image points beyond 700 point radius - the basic model does not extrapolate well)

Your world reference points can’t be a degenerate case. Not all on a line in world coords, and really its best if you have points in various places in the image. For example a few points on the left side, right side, top and bottom is probably better than a few hundred points all clustered near the top right.

Your point localization in image space must be good. If you can design the points you are finding (as opposed to relying on features that exist) chessboard-style corners work well (especially when using cv::cornerSubpix() to refine the estimates.

Use a lot of points + a Ransac approach (or some method to filter outliers) I routinely get a 5x improvement in reprojection error by filtering outliers. You have to be careful that you aren’t just improving the results by removing points (because you can get zero reprojection error if you filter all but 4 points) but in the real world there are almost always outliers, and removing them will improve your estimate significantly. solvePnPRansac is your friend.

As far as how applicable it is to various real world situations, it’s valid and applicable at all scales. If you have good calibration (intrinsics) and accurate 3D data, you should be able to get similar pose estimation at any scale (given the right optics for your situation, etc)…but the key is that if you are using the camera to look at 4"x4" area, a 1 pixel reprojection error will be very small in 3D world space, say 0.005", but the same 1 pixel reprojection error will represent a much larger real-world error if you are imaging something that is 4 miles x 4 miles in area.