why do you call solvePnP
? why do you not use aruco’s pose estimation?
that scale factor could be due to a bad/wrong focal length or the entire camera matrix is wrong. probably bad calibration. if you think the calibration is good, it still might be not. beginners can never tell but they think they can.
I see no calibration data (intrinsics) so that is impossible to validate.
I see no experimental setup where you place the marker at a known measured (with a ruler) distance and provide the original camera image along with the marker’s size.