Object Tracking from a video and Getting 3-D Coordinates of objects

can you explain, what you’re trying to achieve, the context ?

usually, you need a stereo camera (block matching), or multiple images (SFM), to estimate depth,
however, there are recent efforts to use cnns on a single image, which might work in certain situations (street traffic, indoor rooms) they were trained on