What is the arbitrary scaling factor s in camera calibration?

I’m trying to figure out what size an object at distance d from the camera would have in pixels.

In my understanding this is exactly what the camera matrix should do. Given a 3d Point (x,y, d) it should project that point onto a UV Image plane (u, v). But in the documentation an “arbitrary scaling factor” s is introduced which seems to throw away all the useful information that I put into the calibration. Meaning the physical coordinates of my chessboard corners in mm.

To my understanding, mathematically, if opencv didn’t normalize by that factor s all the relavant information would fall out of the calibration process, including the physical sensor size etc. Why is all of that thrown away? And if It’s not can I somehow retrieve that factor s to do 3d reconstruction in phsyical units using the camera matrix?

Thank’s!!

I’m now at a point where I’m pretty sure I can not retrieve that normalization factor.
It makes me kinda angry, I feel like I got to reimplement the whole calibration process, if I want an accurate projection matrix in physical units, that is estimated on multiple images. Even though all of that is already implemented, tested and used by thounsands of people. But no one ever thought we might need that data.

You wouln’t even need to provide the values for cv.calibrationMatrixValues() externally they’re all in these calibration functions … just not returned ._.

Just gonna use the sensor size as given by the manufactorer to retrieve a physical projection matrix, since the former is just too much work. Even though these hardcoded values are completely unessasary (from a birds eye view on the problem).

that represents the projection, i.e. the division by z. note how in that equation both sides are 3-dimensional vectors, and one of them has z=1, and both sides are equated. s represents the division. in fact s = z… of whatever’s on the RHS if you already performed the matrix-vector multiplication.

other times it’s called w.

(x, y, 1) * w = (xw, yw, 1w) = (X, Y, Z)

no. it merely “throws away” the z information. which is what a projection is supposed to do. because everything’s on the same plane now. and z isn’t actually gone, because you still have that, just not in that vector.

Well, I know what a projection is. Ofc. when projecting a 3D point or a set of them on a plane the Z cordinate is lost, that’s not what I’m complaining about.

This taken directly from the documentation this is the s I’m referring to:

s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} X_c \\ Y_c \\ Z_c \end{bmatrix}

Let’s take a simple example to explain what I mean.

Suppose f is the focal length of my camera, s_x is the width and s_y the height of my sensor in millimeters, p_x is the width and p_y is the height of my image in pixels.

then this should follow (assuming no translation and rotation):

A * \begin{bmatrix}s_x \\ s_y \\ f\end{bmatrix} = \begin{bmatrix}p_x \\ p_y\end{bmatrix}

(where A is the camera matrix)

This is not the case since the calibration matrix is additionally to beeing normalized by z normalized by s. As the documentation says and as these lines of source code do.

Or in other words when calibrating, it does not matter by which constant factor all of my World-Coordinates are multiplied, you allways get the same camera matrix. However large my chessboard squares are in physical units, is factored out to get the calibration matrix only in terms of pixels (which would be fine, if that factor was returned anywhere)

If you want to understand this, you need to let go of some misconceptions, among them the math you already stated, and a bunch of your claims. And you need to accept that the confusion isn’t an issue with the math that’s in the docs, but primarily with the math you invent, and secondarily how you interpret the docs.

All the math in the docs, that you’ve quoted so far, is fine. Start from that. I’m hedging my bets, because the docs aren’t infallible. These parts of the docs are fine though.

The naming you come up with adds to the confusion. Camera matrices are called K, not A. and if you want to involve sensor size and resolution, I’d recommend doing that purely to calculate the focal length (as pixels).

focal length is two things: either a physical quantity (length), or a number of pixels.

f [pixels] = (focal length [mm]) / (pixel pitch [µm / pixel])

The optical axis is the center. The view is a triangle that is similar to the triangle spanned between the lens origin and the sensor placed at the focal distance (physical distances). as such, if you want a point on the edge of the view, you have to place it a focal distance away (ok so far), but HALF the sensor width off the axis, not an entire sensor width away.

That equation doesn’t even have the right shapes. Does not compute. the matrix is applied backwards (v M), needs to be applied the right way (M v). And I would recommend putting inputs on the RHS and the result on the LHS.

\begin{bmatrix}p_x \\ p_y\end{bmatrix} = K \cdot \begin{bmatrix}s_x \\ s_y \\ f\end{bmatrix}

Still needs projection, and a fix on the LHS for the shape.

s \cdot \begin{bmatrix}p_x \\ p_y \\ 1\end{bmatrix} = K \cdot \begin{bmatrix}s_x \\ s_y \\ f\end{bmatrix}

Just try to calculate the RHS. you’ll get some vector that’s not on z=1. divide by z. then it’s got z=1. that’s it. that’s what the s is there for.

Certainly the camera matrix isn’t being normalized by anything. It’s a constant, unaffected by what the camera sees. there can be no s and no z affecting it.

You’re linking to some code, the head of a loop, that contains some additions and multiplications, but no division (normalization). the loop following it goes over a 4-element thing and divides something, but I don’t immediately see the relation between that and this discussion.

That is true. that has a little to do with the s up there, but only insofar as it’s just another scale factor in that equation, which you would put between K and the [Xc, Yc, Zc] point, to signify that it applies to the geometry. mathematically, all the scale factors can be combined, but that’s obfuscation. The parts of the equation have individual meaning. The s is purely there as something expressing the projection to the z=1 plane, i.e. a division. Multiplying by a projection matrix (the camera matrix) is just the first part of a projection. The second is that division, which brings all the points onto the projection plane.

that part of the quote is fine, mostly. nothing is factored out because that thing you think exists, doesn’t. the distance information simply doesn’t exist in a picture.

that part isn’t fine.

that quantity does not exist. it cannot be obtained for math reasons. that information doesn’t exist in reality, and not in theory.

maybe what you want is pose estimation, of the calibration board in each calibration view. that requires a model of the object. if you give pose estimation a model that’s twice as big, it’ll put the object at twice the distance, given the same picture. that is “similar triangles”.

1 Like

Thanks for the long answer and correcting the mistakes I made in my question! (Edited my question to save readears the time)

Camera matrices are called K, not A

They called the Camera Matrix A all along the the documentation that I linked to. I tried to adapt :smiley:

The Main point of the discussion and my question is:

that quantity does not exist. it cannot be obtained for math reasons. that information doesn’t exist in reality, and not in theory.

And if you’re so sure about that, I’ll trust you. I’d still be interested in a detailed mathematical explanation of why.

So this stackexchange answer is definetly completely wrong?

What you’re saying is that even though i have a calibrated camera and a z coordinate to my object it is impossible to figure out the size of that object because the camera matrix can not provide that mapping?

that answer is fine.

what gave you the impression that it weren’t, or that I would think so? that’s not a rhetorical question. I’d like to know what made you think that. if you were just using some rhetorical device to goad me into responding, I don’t appreciate that.

I’m not saying that. that’s an entirely new situation, at least as I understood what you wrote so far.

yes.

yes.

no. the projection of the 3D point is the screen-space point (x, y, 1) or (x, y).

you misunderstand the point of s.

sensor size cannot be determined from intrinsic calibration, no matter “if opencv didn’t normalize by that factor s” or did.

Wait … you said that was true. So if that is true then there does not exist any relation between the Z-coordinate in mm and that camera matrix. If that is true it would not matter to opencv if I provide the Z-Coordinate in Millimeters or in average Zebra lengths. Both would probably be wrong. So then this stackoverflow answer would be wrong. Im getting more and more confused the longer we talk ._.

Maybe this is the crux (from stackoverflow):

the z-component you obtain the 3D point in the camera coordinate frame

How to obtain that Z coordinate in camera coordinates, when I know it in millimeters? And I think also this is what could fall out of the calibration process: the conversion from physical coordinates to these camera coordinates, if I provide my world coordinates in physical units.

EDIT: this answers all my questions, I think.

the SO answer discussed “unprojection”, having 2D points and the object model (or size), and obtaining some 3D coordinates.

for a single point, that means projecting it onto a plane that’s the given distance away.

for several points making up an object, it involves pose estimation (fitting the model into the 2D point set, yielding the object’s pose in the camera frame).

without any distance or model size information, all you can do with a 2D point is to turn it into a ray.

1 Like