Non-planar stereo camera calibration (Two cameras with a physical Z-offset)

Hi all,

I’m planning on creating a stereo camera that will have two camera modules on flexes. They will have different Z-offsets from each other, but will otherwise be pointed in the same direction (co-planar).

Will I have trouble with extrinsic calibration / disparity doing this? They will have about 2cm difference in Z-offset.

I can’t predict that, don’t have experience with such setups.

imagine the epipolar lines. calibration would probably have to severely contort the views.

the involved implementations make some assumptions. I am not sure what you are doing falls into those assumptions.

you’d best grab all the books on this that you can, and do literature research for papers.

Thanks.

Hmm…Wondering if there are assumptions in stereoRectify that make this doable.

I’m getting an issue where the calibration seems to pretty severely crop the usable area.

block matching requires epipolar lines to be parallel and horizontal.

imagine the line connecting both camera origins. all the epilines will be parallel to that line, or rather, meet at infinity in the vanishing point of that connecting line.

for cameras that sit beside each other, so no Z difference, that line will be crossing at right angles in front of the cameras, have no vanishing point, so the epilines will be parallel in image space too.

if the cameras have Z difference (but otherwise face the same direction), that connecting line will have a vanishing point in each view. it might be off-screen or it might be in view. in any case, that is where the epilines in both views will meet. they are not parallel in image space. that is not what block matching requires.

stereoRectify’s task is to warp a view such that those lines become parallel. that function does NOT expect such intentional shenanigans. it expects nearly parallel epilines. as I said above, warping those views is going to require serious contortions and likely overwhelms the function’s abilities. the implicit assumption is that you aren’t doing any weird stuff. what you are doing is weird stuff, not a regular stereo rig. imagine warping those views. “virtual” cameras would have to be constructed that are facing at right angles to the connecting line, i.e. the camera matrices are calculated to be panning sideways. and then the view pyramid/frustum is going to (have to) get skewed to keep the actual image content within it. that means the new views won’t have their principal points in the middle of the image content, but somewhere off to the side.

so, yes, books and such. do not expect the library to be helpful here. the functions, even for what they’re intended for, are severely under-documented and contain magic that is hard to predict. one piece of magic is that cursed “alpha” parameter adjusting the crop. more magic happens in all the numerical optimizations that throw matrices at you that are close to inexplicable. at least that was my experience last time, when I was new to all of this, hadn’t had any classes on the topic, and thought it can’t be that hard, the library will guide me.

1 Like

a sketch of the situation

not as illustrative as it could be. some angles happen to be equal, causing the views’ left edges to coincide with the optical axes, because I didn’t plan the pen strokes.

Thank you. It sounds like this is very much unexplored territory… “here be dragons”.

And yeah… the alpha parameter is one of the many things that is very confusing. Different values gives me results that I can’t really explain. 0.8 vs 0.9 vs 1.0…

It sounds like I’m setting myself up for pain trying to do this.

I’m playing around setting stuff up in Blender to generate synthetic data, but so far not having tremendous amounts of luck getting it to behave.

I can’t speak to the capabilities of stereoRectify() or other openCV functions, but I don’t think there is anything fundamentally wrong with your setup. I’ve always thought of the “two cameras in the same plane” as a special case / simplification of the general case, and I would expect it to work for a very wide range of camera configurations, including Z differences, significant rotation between the cameras (non-parallel optical axes), etc.

In the discussions I’m familiar with, the epipoles (where the camera center of one camera projects to in the other camera’s image) is usually within the bounds of the image (probably just for the sake of clarity / ease, as there is no such requirement). (See images from “Multiple View Geometry” Hartley/Zisserman below).

Later in the same book (section 11.12, Image Rectification), they go on to say:

“This section gives a method for image rectification, the process of resampling pairs of stereo images taken from widely differing viewpoints in order to produce a pair of “matched epipolar projections”. These are projections in which the epipolar lines run parallel with the x-axis and match up between views, and consequently disparities between the images are in the x-direction only, i.e. there is no y disparity.”

The points here are:

  1. Works for widely differing viewpoints
  2. The purpose of rectifying the images is to simplify the search for matching features. You can also search for matching features by searching along epipolar lines in the original (non-rectified) images, but it’s just easier to search along horizontal lines in the image.

Again, I can’t speak to the implementation of the openCV stereoRectify() algorithm, but I have always assumed it would work just fine in the general case.

I would recommend picking up a copy of “Multiple View Geometry” by Hartley/Zisserman


1 Like

I’m gonna go off on what seems like a tangent. my perception of the situation probably errs on the side of thinking it worse than it is. there are however clear limits to the math and the practice.

you stated Z difference of “2 cm”, but how much baseline/IPD?

yes, that is for illustration purposes, “not to scale” I’d say.

mathematically, those points are always “in view” because the field of view has no reason not to approach 180 degrees.

in the case of the general multi-view situation (SFM etc), it’s fine to have those points be practically in view, and even has advantages. optical axes crossing at right angles would give the math the best possible condition.

for block matching, which is a stereo situation, it’s actually misleading, a bad situation to have. by that I primarily mean Z differences (“side-eyed”) but also severely “cross-eyed” setups. these only differ in which eye goes “cross” in what direction. the more “cross”, the worse.

the epipole in view… that means you could see one of your eyes with your other eye (remove nose to demonstrate), that severely cross-eyed. imagine trying to rectify such a view. imagine the homography. you have that vanishing point in view, and now you’re supposed to put that off to the side (“maps the epipole to a point at infinity”), produce a top-down view (parallel epilines).

imagine a front-facing camera in a car, on a straight road. the vanishing point of the road is in view. now try homographing that to a top-down view.

the situations are equivalent.

stealing pics from elsewhere:

so YES, if the goal is block matching for stereo vision, then avoid any such situations.

supposing the starting situation is not as severe, then you can work with that. the newly calculated views (camera matrices) still have some properties to be aware of (principal point relative to image bounds, …). it’s not like you still look at the target straight on. if those were your eyes, your fovea/foveae would be out of work very quickly. you’d look at the target peripherally.

slight Z differences, as in Fig. 9.3 in Steve’s post, can be corrected, but that is still a materially (if slightly) worse starting situation than no Z difference. it’s a correction of a suboptimal situation, not simply a normalization/transform of a fine situation.

I’m thumbing through the book for any discussion of block matching for stereo vision. there is a little bit written on the geometry of it in chapter 11.12. not enough to help you practically or convey any intuition. that is an exercise to the reader.

yet the example images in the book don’t show severe examples. they show some in-plane rotation, a bit of cross-eyeing, but nothing where the epipole comes near being in view.

that’s what I take issue with. it’s a math book only, not helping with practical aspects. the book makes those claims without actually demonstrating them. do not just buy into that. printed word does not win arguments by virtue of having been printed. the arguments have to convince. I believe I demonstrated what is even in the realm of entertainable positions with respect to the situation.

part from the book:

Since the application of arbitrary 2D projective transformations
may distort the image substantially, the method for finding the pair of transformations
subjects the images to a minimal distortion.

that is waffling. non sequitur. prime example of my disdain for “academic” writing. the antithesis to technical writing. that phrase alone should shake anyone reading the paragraph into critical reception. the transform is determined (in non-degenerate cases), i.e. the perspective part of it. there is no “minimal distortion” solution to choose from any solution space. any degrees of freedom left are non-consequential to quality: zoom/scale/stretch, translation and rotation. at most, one should caution to pick those degrees of freedom pragmatically (i.e. not scaling down to thumbnail size or squashing comically).

1 Like

Thank you to both for your inputs.

I’ve been trying with variations of this sort of scene (note, some of the pics below are from slightly different arrangements of objects):

(I output L / R images without the R / B effect – this is just for viz)

In this example, they’re set to 25mm focal length cameras, and are 0.06m apart in the X axis, and 0.01m apart in the Z axis.

If there’s no Z displacement, I get reasonable(?) results such as this:

If there is Z displacement, things rapidly go to hell. The rectification breaks completely.

I get stuff like this:

alpha doesn’t really help much.

I’ve been exploring ZNCC and SAD, and my implementation seems to be a bit better. I’m able to something usable (note that there was a matt with texture on it below the apple, hence the unusual patterning):

But it’s slow as heck (all python, and horribly un-optimized code)

I can agree that the in-plane translation-in-x-only configuration is good for many stereo vision applications, but primarily because it simplifies finding correspondences. You get to use block matching because there aren’t scale or rotation differences between views. This is no small thing! But it’s also a constrained setup that doesn’t work in all cases.

For example, if you want higher precision depth measurements, you have some options including narrower FOV and wider baseline. But a wider baseline and/or narrower FOV means a larger “dead zone” where you can’t get depth measurements. A reasonable and workable solution is to increase the baseline and rotate one or both of the cameras so you cover the volume you are interested in. Yes, you probably have to give up block matching, but there is nothing inherently wrong with this setup.

The example you gave of a front facing camera on a car is, I think, an extreme case. The image was transformed to match an overhead view, so a 90 degree rotation between the two views/cameras. A 90 degree rotation might be too much, but what about 45 degrees, 30 degrees etc?

I’m not trying to get into an argument about this, but I wouldn’t want someone to find this thread and think that stereo vision only works in highly constrained configurations.

When you are testing different configurations, are you calibrating the rig (calling cv::stereoCalibrate), or are you manually setting the parameters based on the physical setup? If you are calibrating it each time, do you get good scores and do the results match what you expect?

I’m surprised that a 1cm Z displacement would be so disruptive and I’m curious why the rectification breaks completely. I wouldn’t expect it to be so brittle, and I would want to understand what is going on and why.