I’m experimenting with computer vision and AI models to detect people in a video stream and then map their positions onto a 2D map.
Right now, I’m trying to find a solid way to convert the coordinates from the detected rectangles (bounding boxes) in the video stream to this map. I initially tried using a homography, but the results weren’t very accurate. So now I’m digging into proper camera calibration to improve the mapping.
My plan is to use OpenCV to calibrate the camera, to compute the intrinsic matrix, distortion coefficients, and extrinsic parameters. I’m using a standard approach: capturing several images of a printed (or on-screen) chessboard pattern and passing that into OpenCV’s calibration functions.
That part is fairly straightforward, but I’m unsure about one thing. The AI models I’m using typically take a fixed input size, for example 640x640, and return detections in that same resolution. So my question is: should I calibrate the camera using the resized 640x640 stream, or should I use the original full-resolution video feed for calibration?
My gut says I should use 640x640 since that’s what the models see, but I’m not entirely sure if that’s correct.
I’ll have to make some assumptions since you speak of people in a video stream, which could mean a lot.
homography
the only thing in plane is a person’s feet, if the feet are on the ground. if you map anything, map that point, not the whole bounding box. only points on the ground will map correctly.
look for models that can take arbitrary-size inputs. they are called “fully convolutional”. they may impose certain multiples of a pixel of length in each dimension.
if you want to use a model with a fixed input size, or you don’t want to run on the full resolution image, just resize your actual image to fit. make sure not to stretch or squash the data. circles have to remain circles. that will require cropping or padding.
do not mess with the image for the purposes of calibration. calibrate on clean data. anything after that can be calculated to match any resizing operations.
Thanks @crackwitz! You’re correct that I should map the person’s feet indeed, that is what I wanted to try. However my initial experiment used cv::findHomography()
where I took a photo with my phone of a book laying on my desk, then I measured 4 (to 6) coordinates from this image and manually mapped them to a 2D cover image of that book. Then I tried to transform some points from the photo I took but the output positions were not very accurate.
Therefore I’m now looking into camera calibration. I hope that this will give me better results. Thanks for the suggestions about resizing/cropping. If I’m correct, some models simply resize the training data w/o taking care of correctly resizing/cropping the input. I normally look at how the model was trained and follow that exact approach.
And thanks for the tip on how to perform calibration. To make sure I completely understand this: to calibrate I must use the original video resolution, not the resized/cropped images that I feed into my model. After running inference I convert the detected bounding boxes, back to the their original resolution so that they match the resolution of the video stream.