Estimation of 3D pose of object from multi mono camera setup

I have a setup where 2 mono-cameras (not the stereo camera) have the same region of view, and I have to detect a 3D pose of a certain object (let’s say, human).

So far what I have understood, I have to perform the following steps:

  1. Detect a 2D BBoxes of a person with YOLO
  2. Calculate detectors (keypoints) and descriptors of each camera’s image
  3. Associate Keypoint Correspondences within Bounding Boxes
  4. Calculate Essential Matrix which should give a transformation between two cameras
    • The calibration of both cameras are possible.
  5. Perform a Triangulation or Bundle Adjustment (not sure here)
  6. Estimate 3D Pose of a person? (not sure yet)

I would be appreciated if somebody could verify if I follow the correct steps