don’t be so vague. be concrete. show pictures instead of writing text.
assumptions:
- the “scene” is flat, like an optical mouse looking at a desk
- image plane is parallel to ground plane
optical flow. dense or sparse. start with “DIS” optical flow, in OpenCV. you get motion vectors. some simple statistics (mean) give you translation… between frames.
if you want stable results, you’ll need some logic that calculates differences to a reference frame, as long as they’re small, and only changes the reference frame when distance has become large enough to require using a new one.
check out How to determine average x + y motion of keypoints in video?