Thanks @crackwitz! You’re correct that I should map the person’s feet indeed, that is what I wanted to try. However my initial experiment used cv::findHomography() where I took a photo with my phone of a book laying on my desk, then I measured 4 (to 6) coordinates from this image and manually mapped them to a 2D cover image of that book. Then I tried to transform some points from the photo I took but the output positions were not very accurate.
Therefore I’m now looking into camera calibration. I hope that this will give me better results. Thanks for the suggestions about resizing/cropping. If I’m correct, some models simply resize the training data w/o taking care of correctly resizing/cropping the input. I normally look at how the model was trained and follow that exact approach.
And thanks for the tip on how to perform calibration. To make sure I completely understand this: to calibrate I must use the original video resolution, not the resized/cropped images that I feed into my model. After running inference I convert the detected bounding boxes, back to the their original resolution so that they match the resolution of the video stream.