Mapping an image from one camera pose to another given a planar scene

I’m trying to map an image of a 3D object to another one from a different camera position. My assumption is that the pose change between two cameras is so small and the object is far enough from the camera such that I can assume it to be a planar object and thus use planar homography.

I’m following the theory in Section 2.1 of Szeliski’s book, specifically the subsection “Mapping from one camera to another”. Following its notation, the projection from the world to screen coordinates can be written as: x_0\sim KE_0p where K is the camera intrinsics matrix, E_0 is the camera extrinsics matrix (including rotation and translation) for the first camera pose, and p_0 is the 3D point location. This equation can be rewritten to compute the the 3D point location p using p\sim E_0^{-1}K^{-1}x_0. Then, this 3D point can be projected into a point in another image by using the second camera pose E_1:

x_1\sim KE_1p=KE_1E_0^{-1}K^{-1}x_0=M_{10}x_0

where M_{10} simplifies to a 3x3 homography matrix by ignoring its last row and column for a planar scene. Then the mapping equation becomes x_1\sim H_{10}x_0 where H_{10} is the homography matrix, x_0 and x_1 are the 2D homogeneous coordinates in the first and second image, respectively. Since the camera stays the same (virtual physical camera in Unity) between two poses, I didn’t include any subscript in K.

I generated test data by inserting a 3D object into a scene in Unity, moving the camera slightly (recording camera poses each time) and capturing the viewport of the camera to images.

I’m using the warpPerspective function to apply a perspective transformation (above H_{10}). Since I record two images corresponding to both camera positions on the Unity side, I can visually check whether I can get something close to the reference image (the second image that I want to obtain by warping the first image, using the camera poses E_0, E_1 and the intrinsics matrix K).
In Python code, I do:

H = K @ E1 @ inv_E0 @ inv_K
dst_img = cv2.warpPerspective(src_img, H, (w, h), cv2.INTER_CUBIC)

So far, I’m getting inaccurate results though, in terms of the position of the object on the warped and reference images. Therefore, I need help on a couple of aspects:

  1. I need a sanity check on the validity of my approach and assumptions, as described above. Do the equations make sense for this use case?
  2. I compute the elements of the intrinsics matrix K by computing the focal lengths f_x and f_y from the parameters of the physical camera in Unity (physical focal length f and sensor size s_x, s_y) as follows: f_x=f*w/s_x, f_y=f*h/s_y where w and h are the width and height of the image, respectively. Then, I build the matrix
K= \begin{pmatrix} f_x & 0 & 0 & 0\\ 0 & f_y & 0 & 0 \\ 0 & 0 & 1 & 0\\ \end{pmatrix}

However, since I need to compute K^-1 and K is not a square matrix, I take the inverse on the leftmost 3x3 part of K and concatenate the column [0, 0, 0] to the result to get a 4x3 matrix K^-1. Or in Python code:
inv_K = np.concatenate((np.linalg.inv(K[:3,:3]), [[0,0,0]]), axis=0)

Is this ok?

  1. For E_0 and E_1, I add the last row [0,0,0,1] to the 3x4 part [R|t] (rotation and translation) to get a 4x4 matrix. In Python code:
E = np.concatenate((R, t[:, np.newaxis]), axis=1)
E = np.concatenate((E, [[0, 0, 0, 1]]), axis=0)

Using this formulation, I manage to get a 3x3 sized H_{10}. That is also the expected homography matrix size for the cv2::warpPerspective function. Is it a valid formulation to manipulate the matrices E_0 and E_1 like this?
4. I’m pretty sure that I need to do a conversion between the left-handed coordinate system of Unity and the right-handed one of OpenCV (both with y looking up). However, I can’t figure out what is the right way to embed that into the above matrix formulations. Can you help me here?

I did some background reading but I’m pretty new on the topic, so excuse the very basic questions on some aspects and the long post. Thanks.

EDIT: I figured out that weirdly, I get the same resulting images for any position \mathbf{t} and only the rotations affect the result. Looking deeper in my calculations, I realized that multiply KE_0 with E_0^{-1} cancels out the position because E_0^{-1} has [0,0,0,1] in its last row, which is multiplied by the elements of KE_0 where the translational terms are located. This makes me think that I may have something fundamentally wrong with the above equation or the construction of my camera matrices. Any help here would be great.