Stereo camera - how small object can be detected?


I’m interested in the depth resolution of stereo cameras.

Imagine that one has a camera that “looks” at a flat surface, it is placed parallel to the surface, and its mounted e. g. 100 cm above.

Do anyone has experience in how small objects can be detected? For example, If I put a small coin on the flat surface can we expect that we see the change in a 3D point cloud (color map)?

you need a good book on the matter. look for “multiple view geometry”, hartley n zisserman.

most of the math is similar triangles, intercept theorem, etc

practical resolution/precision/accuracy entirely depends on everything involved in calculating a depth map from a stereo pair of images. more is involved than you might guess.

you seem to be asking about Z-resolution specifically.

  • let’s say 1920 pixels and 70 degrees horizontal field of view
  • in practice (block matching) you can localize features at sub-pixel resolution but let’s go with full pixel resolution
  • 1 meter away
  • 63.5 mm baseline for the stereo pair
\newcommand{\mm}{~\mathbb{mm}} \newcommand{\px}{~\mathbb{px}} \begin{align*} \tan \left( \frac{70°}{2} \right) \cdot f_x & = \frac{1920\px}{2} \\ fx &= 1371.02\px \end{align*}

fundamental constant of your camera (and chosen resolution).

both cameras stare at infinity. their optical axes pierce the wall. those intersection points are 63.5 mm apart. how many pixels apart do those points appear (at a meter away)?

\begin{align*} \frac{63.5\mm}{1000\mm} \cdot f_x &= x \\ x &= 87.06\px \end{align*}

that’s in pixels of disparity. in other words, at a meter distance, 63.5 mm are equivalent to 87 pixels.

I’m gonna reuse x a lot. read it as a question mark.

okay, so let’s add/subtract one pixel from that and see where the pixel goes on that wall:

\begin{align*} \frac{x}{87-1\px} &= \frac{63.5\mm}{87\px} \\ x &= \frac{63.5\mm}{87\px} \cdot (87-1\px) \\ x &= 62.77\mm \\ \\ \frac{x}{87+1\px} &= \frac{63.5\mm}{87\px} \\ x &= 64.23\mm \end{align*}

and now we figure out where those rays would intersect. a drawing of a bunch of triangles would help here but… eh

\begin{align*} \frac{62.77\mm}{1000\mm} &= \frac{63.5\mm}{z_1} \\ z_1 &= 1011.62\mm \\ \\ \frac{64.23\mm}{1000\mm} &= \frac{63.5\mm}{z_2} \\ z_2 &= 988.64\mm \\ \end{align*}

so with that setup, for a full-pixel movement at 1 meter, you’d get ± 11-12 mm of depth.

another example: OpenCV’s (CPU) stereo module assumes 4 bits subpixel resolution, i.e. \frac{1}{16}. in that case you can expect to get less than a millimeter of z resolution…

but that doesn’t mean you’ll get it. that’s just the best case.

you can improve the situation by using a wider baseline, or by getting closer to the object, or with a higher resolution camera, or with a narrower field of view (zoom lens?).

maybe this gives a bit of intuition:

1 Like

@crackwitz thank you for the detailed explanation.

One thing which I don’t understand is “subpixel” resolution. Could you write a little bit more about it?

im = np.zeros((100, 1000), np.float32)

for k in range(10):
    tile= im[:, 100*k:100*(k+1)]
    tile[8*k:8*k+20, 30:70] = 1.0

downscaled = cv.resize(im, (100, 10), interpolation=cv.INTER_AREA)
upscaled = cv.resize(downscaled, (1000, 100), interpolation=cv.INTER_NEAREST)

cv.imshow("upscaled", upscaled ** (1/2.4)) # this is for gamma correction, ignore it

for this blob’s position you could give integer coordinates, but really you could also give coordinates that have meaningful fractional values, i.e. subpixel position.

in this case, the y coordinates would be

>>> np.arange(0, 10) * 8 / 10
array([0. , 0.8, 1.6, 2.4, 3.2, 4. , 4.8, 5.6, 6.4, 7.2])