Merge 2D image and depth map

homography is the wrong approach.

the depth view is taken from a different position, not the same as the visual view.

you need to project the visual view onto the point cloud/mesh, taking the transform between views into account.

their website looks like they have examples for that.