I currently have been thinking about how I can go about solving this problem.

Suppose that I have 1000 images, and in these images I am trying to determine the height of an object in it. I know the specific camera that was used to take these photos like the size of the sensor, lens used, pixel unit, and the specific focal length that was used to take the photo. I also know how far the camera is from the object and I know the pixel top location and the pixel bottom location of the object.

With these given conditions I can do some rough math to calculate the height of the object on the sensor then scale it properly to give me a height estimation. I know that in some of the photos the object is not directly in the center of the image which leads to more estimator error.

I am trying to explore the OpenCV option to be more accurate. However, I no longer have access to this camera to perform the camera calibration. Even if I did have access to the camera again, the distance the camera is from the object is different per image as well as different focal lengths per image.

In these images there is a 5 foot stick next to the object, but in the future I am trying to construct a model that no longer needs this. Is there any possible way I can go about tackling this height estimation with the foot pole in it and without it? Or is there another machine learning library that can help with this?

I should’ve clarified this further. Each image is of a different pole. Not 1000 images of the same pole. What I am asking really is if this is possible with these given conditions.

you need some reference length. it’s either the stick or the known distance.

there are some things you can substitute, but you need enough data as a whole.

if you have camera parameters, that’s valuable.

if you have a decent yard stick with marks, that’s valuable.

if the yard stick is laid right along the pole, that’s valuable.

if you can see the horizon, and assume the pole to be upright, that’s valuable.

don’t ask me for equations. I’d have to derive them from the involved geometry. given all the combinations of pieces of information, it could be a lot of math.

you could just assume no lens distortion. it’ll be good enough, most likely.

it’s near impossible to use such a data set for “autocalibration”. that’d require multiple pictures, from the same camera, of the same scene, which exhibits enough 3D information.