Hi Fuchs, let’s review the basics
There is a 2D space in pixel scale, points in homogeneous coordinates are 3-vectors, often with last element value 1.
There is a 3D real space in meters or mm, points in homogeneous coordinates are 4-vectors, often with last element value 1.
Homography (aka perspective transformation) is a 3x3 matrix, mapping from 2D to 2D. Resultant last element often differs from 1.
Pose matrix (aka euclidean transformation, or rototranslation) is a 4x4 matrix, mapping from a 3D points from one reference system to another. Usually it maps from an arbitrary “world” reference system to the camera reference system.
Projection matrix is a 3x4 matrix, mapping 3D points from an arbitrary reference system in real space, to 2D points in the image reference system. Projection matrix mapping 3D points in the camera reference system can be constructed from 3x3 camera matrix, adding a fourth column with 0. Projection matrix for other 3D reference system can be constructed multiplying the above mentioned by a 4x4 rototranslation matrix.
Once you have you 3x4 projection matrix mapping from world reference system in m or mm to the image coordinate system in pixels, you can get the 3x3 homography matrix by stripping the third column. This homography will map XY plane in world reference system to image.
Stripping the 3rd column is magic , no one really knows how it works, but some have pointed out that if Z is always 0 in the 3D point, that column is useless.
I wish you luck, magic, and a solid knowledge of the underlying math.
Let me know if you get it working.