the problem is called Simultaneous localization and mapping - Wikipedia
you should use a stereo camera for that because it already produces point clouds reliably.
if you absolutely have to use a single (=monocular) camera, you face a second problem: Structure from motion - Wikipedia