Inconsistent results with `recoverPose`

I would like to estimate the pose of the camera between frames in a video. I care more about the relative rotation between frames than the translation. Unfortunately, I am getting inconsistent results and I do not know where to start debugging. Perhaps you can help.

The process to recover the pose consists essentially in these steps.

  1. Camera calibration. I get an error of about 0.03, which I think is ok.
  2. Read two consecutive frames. I convert both to grayscale and undistort them with the camera parameters found with the calibration process
  3. Find SIFT features in each frame, match them with Flann method, and retain those features that are closer
  4. Find the essential matrix (cv.findEssentialMat(...)) and then recover the pose (cv.recoverPose(...)).

What I expect

The estimated camera pose for these two frames (visually identical)

is an identity matrix for the rotation matrix R:

[[ 1. -0. -0.]
 [ 0.  1.  0.]
 [ 0. -0.  1.]]

and a translation vector [-0.58 0.58 -0.58]

What I do not expect

But when process two other frames (also visually identical to the previous ones, but that I cannot upload due to limitation to first time users), I get a rotation matrix R

[[-0.96 -0.25 -0.1 ]
 [-0.25  0.71  0.66]
 [-0.1   0.66 -0.75]]

and a translation vector [ 0.14 -0.92 -0.36]

Despite the frames being almost identical, sometimes the camera pose estimate is quite off. There is nothing inherently strange with these specific couple of frames I have attached. Every time I rerun the algo, some frames will be processed correctly and some others (almost identical) not so much. The features in all frames are dense and seems to match quite well.

Do you know what could cause this issue?

python                    3.11.5 
opencv-contrib-python     4.8.1.78

Some observations, suggestions:

  1. 0.03 reprojection error is excellent - how many input images were used to achieve this result? If only a few, you might consider using more (say 8-12) - your score might go down, but the resulting parameters might actually be more accurate / correct. (I’m a bit skeptical of your 0.03 score)
  2. The foreground of your image looks to be noisy / not very structured, so it seems likely you could be getting matching features between the two frames that aren’t actually from the same scene position. What happens when you filter these matchpoints out from the calculation? (Just for testing purposes, maybe do something like eliminate all points with a Y image coord > 1000 (for example))
  3. How about trying a RANSAC approach for the findEssentialMatrix
  4. Since you are applying this to video, you might want to consider using a Kalman filter (this doesn’t really apply to your current situation, which you need to address first, but will probably become relevant once you are computing the pose continually for a sequence of frames.)

OP used the same picture for both views. a perfect identity transform with perfect reprojection was expected. OP though got results that are not identity transform.

I interpreted the 0.03 reprojection error as being from calibrateCamera(), not findEssentialMat(). Upon re-reading, it’s not clear to me what the 0.03 value represents.

As for the results he’s getting from findEssentialMat(), my read is that in some cases he gets what he expects (an identity matrix), but with other image pairs he gets something else.

If I understand correctly, OP is taking two sequential frames with no camera movement. So the images are nearly identical, but not the same actual image (difference being sensor noise etc.). My hunch is that step 3 is where the problem lies (SIFT + Flann + “keep the ones that are closer”), and that the matchpoint set contains errors. (The foreground of the image just looks problematic to me. OP does say that the features seem to match quite well, but unfortunately can’t post the second image.

If it were me, I’d be scrutinizing the matches from a “bad” image pair and / or running findEssentialMat on subsets of the matches.

Hi all, thank you for getting back to me. I will try to clarify some of the confusion.

The camera is a webcam (Logitech C920). The 0.03 is the reprojection error from cv.projectPoints() right after I calibrated the camera. I followed the tutorial OpenCV: Camera Calibration. I used 18 pictures of a grid taken inside. The grid pattern is this one. The pics of the grid look like

.

Perhaps I need more variation?

Correct. I am working with a test video, but in the beginning of this video the camera is still. I am comparing the frame with the one right before. The frames are basically identical a part from some sensor noise.

I retain the good matches using this procedure that I found in the OpenCV doc (unfortunately new forum users cannot post more than 1 link)

matches = matcher.knnMatch(descriptors0, descriptors1, k=2)
good_matches = []
for m, n in matches:
   if m.distance < 0.6 * n.distance:
      good_matches.append(m)

I will try what you suggested. I will also clean up the code so I can post it. Thank you both!

Here’s a video showing the comparison between consecutive frames:
results.mp4 - Google Drive. Note: original video is 60 fps but this one I uploaded is at 1 fps for convenience.

I changes to the Brute force matching algorithm to keep it simple and have less parameters to tune

matcher = cv.BFMatcher()
matches = matcher.knnMatch(descriptors0, descriptors1, k=2)

good_matches = []
for m, n in matches:
   if m.distance < 0.7* n.distance:
      good_matches.append(m)  

Then the camera pose is estimated with:

def camera_pose(keypoints0, keypoints1, matches, camera_matrix):

    points0 = [keypoints0[i.queryIdx].pt for i in matches]
    points1 = [keypoints1[i.trainIdx].pt for i in matches]

    points0 = np.asarray(points0)
    points1 = np.asarray(points1)

    E, mask_inliers = cv.findEssentialMat(
        points1=points0,
        points2=points1,
        cameraMatrix=camera_matrix,
        method=cv.RANSAC,
        prob=0.99,  # default 0.999
        threshold=1.0,  # default 1.0
    )

    inliers0 = np.asarray(points0)
    inliers1 = np.asarray(points1)

    _, R, t, _ = cv.recoverPose(
        E=E,
        points1=inliers0,
        points2=inliers1,
        cameraMatrix=camera_matrix,
        mask=mask_inliers,
    )

    return (R, t)

In the video I wrote the value for R. Sometimes it is as expected [[1, 0, 0], [0, 1, 0], [0, 0, 1]] but often it is not.

Code is here in case test-camera-pose (github.com)

Some additional comments:

  1. Logitech C920 uses a voice coil focusing system, I believe. I have not had great luck with the stability of voice coil focus optics. If possible, I would suggest getting a camera that has a mechanical (and lockable) focus. If you must use that camera, at a minimum ensure that the focus is in manual/absolute mode. I use the V4L2 backend directly with my cameras and the relevant control is V4L2_CTL_FOCUS_ABSOLUTE - hopefully openCV supports this setting. Disable autofocus and set an absolute value that gives you acceptable focus and use that value always for all operations. The problem is that if you use autofocus (or varying absolute focus values) the focal length of the lens changes, and possibly the image center and distortion as well. Better yet, you pick your manual focus value that you like and then you carefully epoxy the lens to the housing so that it is physically restrained from moving. To be clear I’m not saying that this is why you are getting inconsistent results with your test, but I do think it will cause problems at some point if you don’t address it. (And maybe you are already aware of this and have dealt with it…)

  2. Your calibration images look OK to me, but I would want more points closer to the corners, especially if you intend to use the corner of the image for any of your calculations. I would suggest looking at the Charuco calibration process (Charuco = chessboard + aruco) The aruco markers allow for identifying individual chessboard corners when only part of the pattern is visible. This is helpful because you can get points closer to the edge / corners of your image since you don’t have to see the full target. The code is a little different, but there are tutorials available and it really isn’t that hard to manage. There are online Charuco target generators as well. (Again, I’m not saying that this is your current issue, but I do believe you will get better camera calibration if you do this - the score might be higher, but the actual accuracy will be better, particularly if you need accuracy at the edges / corners of your image.

  3. I can’t really comment on the way you are filtering the points, but as I understand it you end up with a set of image correspondences, and because the images are essentially the same it is trivial to validate your correspondences. The distance between image point 1 and 2 should be approximately zero for all of your correspondences, so if you find ones that aren’t, dig deeper. (Of course this test is only possible because your images are the same, and isn’t something you could apply once there is motion in your video sequence - the point is to use this test to identify the problem so you can fix it, not to rely on it as a way to filter points further.) When I encounter a situation like this I’m likely to compute the Euclidean distance between the image points, compute some basic statistics on the set (mean, standard deviation), and then flag correspondences that are, say, 2 or 3 standard deviations above the mean. I’d then draw those points on the image pair and inspect the image to see if I can make sense of what happened. But I’m getting ahead of myself.

I really don’t have much experience in this domain, but my gut says that you are getting “good” matches (the descriptors match well) from scene points that don’t match well at all, and this is polluting your data and giving bad results. The surface that your robot is standing on - maybe it’s carpet or something similar - looks to me like it would not provide very good / uniquely identifiable features, but rather would provide a lot of opportunity for matching one noisy / unstructured area with some other noisy / unstructured area, especially when you add image noise into the mix. I don’t think I’d want to rely on any of those points that come from the “floor” area, especially once you start moving the camera around and seeing them from different views.

Closing thought
How did you choose 0.6 as your “close enough” value? Have you tried something more restrictive, say 0.1? Maybe consider varying this value so that you are only keeping the best 30 matches?

(the video you posted is not publicly accessible)

Excellent observation. I did have autofocus on. I am using this camera for testing purposes and I am not planning to using it long term. I retook some pics/videos with autofocus disabled now. A quick search revealed that OpenCV can be compiled to use the V4L2 backend. Good to know.

Thanks, I will try this next time.

This is the part I am most unsure about to be honest. I followed this tutorial OpenCV: Feature Matching and the Ratio test is brought up in the section Brute-Force Matching with SIFT Descriptors and Ratio Test. To simplify, I am now following the simpler approach described in the section Brute-Force Matching with ORB Descriptors. That is, I am using BFMatcher.match() to find the best matches and then I am keeping the 100 closest ones. This makes a bit more sense to me.

matcher = cv.BFMatcher(normType=cv.NORM_L2, crossCheck=True)

matches = matcher.match(descriptors0, descriptors1)
matches = sorted(matches, key = lambda x:x.distance)
matches - matches[:100]

EDIT: I found the justification for the ratio test. It is taken from the SIFT paper and there is a shorter description at the official OpenCV: Introduction to SIFT at paragraph 5. Keypoint Matching. From what I understand, if you use a KNN matcher with k=2, you want to exclude those 2 matches if they are too close to each other as it could be a symptom of noise.

The focus of the camera is set to infinity; I recalibrated it and recorded a new video. The results are ok most of the time, but sometimes they are still quite off. Perhaps, this is the expected performance. I just wonder how it will work in a real world test where videos may be much noisier and surroundings much more cluttered and dynamic. Looking at the video, it seems that the keypoints are matched correctly even when the rotation matrix is way off.

Video here results.mp4 - Google Drive (Hopefully accessible to anyone now)

Based on the video it does look like most/all of the matchpoints are valid, even when you get bad results. I would want to be 100% certain of that, though, and looking at the image pair and trying to follow the lines visually is hard. How about plotting points0 and points1 to a single image (connecting each pair with a line, or drawing an error vector, or similar). Or just look for cases where points0[i] - points1[i] is big (say more than a few pixels.

It doesn’t seem like this is your issue - you have a lot of high quality matches and you are using RANSAC in the findEssentialMatrix call (so I’d expect outliers to be effectively handled), but it is still worth checking. Also what does your mask_inliers look like? How many of the original points were retained?

It’s a good time for me to reiterate that I don’t have much experience with the functions you are using. I do have a lot of experience doing very similar things, but I’m relying on intuition and general experience, so bear that in mind.

You are right that you won’t have perfect data once you start using this in “real world” situations. My point of focusing on the matchpoint quality isn’t because you need to eliminate outliers (findEssentialMat with RANSAC should handle that), but because knowing the qualify of the data going in might help shine a light on why the error is happening. Maybe.

The next thing I would look at is whether the error you are seeing is introduced when E is computed or if it shows up only with the recoverPose call. So in addition to watching the R and t matrix, maybe augment your image with E as well. What if, for some reason, mask_inliers isn’t getting populated or used in recoverPose? (just an example - I’m not betting on that being the issue)

I think it’s safe to assume that even after you figure out and fix what is going on, you will have to deal with bad results in real-world situations. Hopefully these bad results are spurious errors and don’t persist over many frames. When you get to this point you’ll need to detect/filter those errors out somehow. There are some quick-and-dirty ways you might approach that problem, but I suggest you look into the Kalman filter.

Best of luck.

Hi Steve, first I want to thank you for your help. You have been clear, patient, and pedagogic.

I have processed a new video (changing platform as now the forum seems to not accept links to GDrive any longer): results

In this one, I use only the best 100 matches (ranked by their distance) and I connect only those keypoints with a distance above 2 standard deviations from average. Most of the time I have well distributed and matched features.

I noticed that the problem may start with estimating the essential matrix E actually. I estimate the essential matrix with RANSAC and the default parameters (changing those does not have a significant effect). In the video, the matrix on top is R, the one at the bottom is E. Both keep changing at every frame.

I am definitely confused. I thought the camera pose estimate would be easy in this quasi-static conditions. Do you know an alternative method to estimate the camera pose by any chance?

I briefly reviewed the section 9.6.2 Extraction of cameras from the essential matrix from Multiple View Geometry by Hartley / Zisserman. I was reminded that:

For a given essential matrix E…and first camera matrix P=[I | 0], there are four possible choices for the second camera matrix P’

Two solutions are simply the translation vector being negated, the other two are by a 180 degree rotation about the line joining the two camera centers.

I don’t know how OpenCV handles this ambiguity in the recoverPose() call, or whether this could be the source of your error, but it definitely raises some questions. I have to wonder if the whole approach is flawed, at least with two nearly identical images. No rotation or translation between the two cameras - is this a degenerate / unstable case?

I’m truly getting out of my depth here, and certainly there are others on the forum with a better understanding than I have. Maybe they will jump in to help.

A few thoughts:

  1. Have you undistorted the image points prior to calling findEssentialMatrix()? If not, try that.
  2. Have you tried to use an image pair that is not identical, but instead has movement between them? Say a simple translation, or rotation. Maybe take a sequence of images from one pose, and then move the camera and take a second sequence. Randomly pick an image from the first sequence and pair it with a random image from the second sequence. Are your results more stable?
  3. You ask if I know of other ways to recover the pose. I do, but I’m not sure they are what you are looking for. I am assuming you are only looking for camera motion in a relative sense - how does the camera pose in frame n compare to the camera pose in frame m. Most of the work I do involves calibrating the pose in an absolute sense - where is the camera in some reference frame that I care about. This takes 3D ground truth points and corresponding image points from the camera you are working with.

To summarize:
Make sure you are using undistorted image points. Try to understand the 4 solution ambiguity and whether it is contributing to your problem. Try to run your algorithm with images from two different views to avoid the possibility of a degenerate / unstable case. Maybe try a scene with more depth variation. Consider buying the book referenced above - if you read and understand everything in chapter 9, I’ll be asking you questions.