Homography-based master/slave PTZ camera sync — is this the right approach?

I’ve been experimenting with syncing two PTZ cameras and I’d love some guidance from people who’ve done this for real.

The setup: two PTZ cameras pointed at a sports field. One I move around manually (call it the “master”); I want the second one (the “slave”) to automatically point at the same spot on the field. My current experiment computes a homography between the two cameras’ pan/tilt space and mirrors the master’s position onto the slave. It works sometimes, and I’m trying to understand why it breaks.

What I’ve run into:

  • I’m fitting the homography from only 5 manually-pointed landmarks (8 DOF). In-sample reprojection looks fine, but leave-one-out cross-validation shows it doesn’t generalize — looks like classic overfitting.
  • I’m treating the homography as static, but PTZ intrinsics/extrinsics change with pan/tilt/zoom, so I doubt one fixed matrix holds across the whole field.
  • Mechanical PTZ latency means the slave always lags the master.

Where I’m stuck:

  1. Is a pan/tilt-space homography even the right abstraction for master→slave PTZ? Or is it cleaner to map both cameras through a common ground-plane / world coordinate frame instead of camera-to-camera directly?
  2. Calibration: how many correspondences would you realistically use? Is “more points + RANSAC” enough, or do people re-estimate the homography dynamically as the cameras move?
  3. For dynamic re-estimation, I was thinking about auto-detecting field keypoints (yard lines, markers) with a keypoint/pose model to generate correspondences on the fly instead of pointing at landmarks by hand. Reasonable, or overkill?
  4. Latency: do you predict/lead the target (velocity extrapolation, Kalman filter), or just react?

I’m doing this mostly in Python/OpenCV for the CV parts. Even a “you’re overthinking X” is welcome — trying to build the right intuition here. Thanks!

where, in pictures and numbers rather than words, are the cameras placed?

what do the cameras see, in pictures, not words?