welcome.
you seem to have a good understanding of the situation. I guess the issue is in the details.
yes, opencv’s assumption is x right, y down (screen/camera), z out.
I’d recommend representing all transformations between coordinate frames as 4x4 matrices. build them from rvecs and tvecs if you have them. don’t use those on their own if you don’t have to. if you have to, pull them out of a 4x4 matrix (Rodrigues function does rvec/matrix conversion).
those 4x4 matrices compose by matrix multiplication. the inverse of such a transformation is… the matrix inverse.
maybe keep a sketch of your coordinate frames and what matrices you have and which way they go. bad/no naming conventions is how people typically get in trouble with this stuff.
I’d recommend deriving all naming from the usual matrix notation, putting frame names on the right and left side. example: a robot-to-camera matrix could be denoted ᶜMᵣ or cam_robot
in code, which represents the robot’s pose in the camera frame, or equivalently maps points from the robot frame into the camera frame.
similar discussion: Need help with Matrix calculations - dynamic grasp point for robot - #14 by crackwitz