How to reconstruct 3D coordinates from stereo camera pair?

Hi, I have a stereo camera pair which only has only an offset in the x-axis, so they are on the same height and their view axes are parallel to each other and to the ground.

My goal is to reconstruct 3D points of elements lying on the ground.

However I am very confused how to proceed.

My planned procedure is as following:

  1. Calibrate the stereo pair with cv.stereoCalibrate()
  2. Use cv.stereoRectify() to obtain both projection matrices
  3. Use neural network to detect objects of interest in both images and use the centers of the bounding boxes as reference points
  4. Use triangulatePoints() to obtain 3D realworld points in meters from the camera in X Y and Z position

I am very new to computer vision and this is the basic plan I manages to acquire during research.

Is this the proper way to proceed? Or am I missing something?

Thank you very much


that would work.

usually people reconstruct a “dense” per-pixel (disparity map and then) depth map and work with that. it’s an expensive process, unless done in hardware (then the hardware pays the price). the algorithm is called “block matching”.

running DNN inference is also expensive. you have to decide if two inferences is cheaper than one inference and one block matching run.

taking a bounding box in each eye and hoping to throw that back into the scene can work but you’ll have to be careful with the geometry.

imagine a picture frame sitting around your object in the scene. in each view, you aren’t getting a frontal view of that picture frame, but a slightly side view, so it’s not an upright rectangle but a trapezoid or something like that. I mean… working with the bounding boxes as they are… is too simple.

without much thought I’d take the vertical center line of each bounding box, each represents a plane into the scene, and they should intersect in the vertical axis going through the object. that’s probably a good start. then you could take the widths of the bounding boxes and figure a radius from that, around the axis.

or you could consider a “pyramid” going through each bounding box, and intersect those two pyramids to get a volume in space.

Hey, thank you for your quick reply. Yeah I forgot that matching exists, of course I can infer one image and match the points on the other one.

Furthermore I would need to undistortPoints() that I classified right?

if you run block matching for a depth map, you’ll undistort and rectify both eyes anyway, so you can run inference on these pictures and that’s it.

otherwise you do need undistortPoints.

Now I am confused. How come they are already undistorted and rectified?

And do I understand you correctly that I either need to infer both images or infer one and calculate Block Matching?

they aren’t. undistortion and rectification is a prerequisite step to block matching.

and yes that’s what I said.