[Help] Struggling to detect pool pockets and place these pockets in live video

I am working on a master’s thesis project -an augmented-reality training system for eight-ball pool. The goal is to detect the real pool-table state from an overhead phone camera, map balls and pockets into table-plane coordinates, and send that state to a Meta Quest 3 Unity application for 3D visualization.
I am currently stuck on live pool-pocket detection and stable table registration.

I would like to reliable compute six-pocket positions in table-plane coordinates. A maximum error of about 0.4 cm / 4 mm would be acceptable. I do not need perfect billiards physics on the computer vision side of the project. I first need stable 2D-to-table-plane mapping.
The Quest app uses a JSON configuration to model the table/environment. The PC-side Python/OpenCV pipeline is the source of truth. Quest only receives and visualizes the computed state.
Setup:
Camera: iPhone 16 Pro Max main camera
Sensor: reported Sony IMX903
iOS: 18.7.7
Capture software: DroidCam
Capture mode: configured as no/low compression as far as I can tell
Camera mounting: as close as possible to directly above the table center, but it is an improvised solo setup, so it is not perfectly centered or perfectly perpendicular
Camera height: 2735 mm from the floor
Table playfield height: 810 mm
Table playfield size: 2450 x 1225 mm
Ball diameter: 57.15 mm
Pipeline: Python 3.12, OpenCV, NumPy, PyTorch, YOLOv5
GPU: mobile RTX 4070, 8 GB VRAM, CUDA acceleration
Runtime speed: usually around 15–20 FPS
I am using pix2pockets as the base detector/reference project.

The reported accuracy there is around 4 mm, not including wrong measurements.

Current status of my pipeline:
Ball detection and classification mostly work. There are some duplicate detections and some misclassifications, but I plan to handle those in Unity/Quest with a user correction or conflict-resolution interface. My current blocker is not YOLO ball detection.
The blockers are:

  • Detecting the table reliably.
  • Computing a valid homography.
  • Detecting or placing the six pockets correctly.
  • Mapping all detections into stable table-plane coordinates.

On static images, I can get the result I want. I can also get success on recorded videos. The real problem is live phone capture. Slight disturbances such as shadows, small lighting changes, people walking near the table, or small frame-to-frame changes can break the homography or pocket calculation.
Sometimes a valid homography is found for a moment, but pocket detection does not lock/stabilize reliably.
Important detail:
My --debug-phone mode uses the same main live-detection pipeline as the normal phone mode, except that it runs offline without requiring the Quest receiver. The static-image debug mode is easier and more stable, but it does not expose the live capture instability.

The relevant table configuration is here:

The values I currently have are approximately:
The files associated with last_environment.json use green table cloth.
Repository/code:
Current project state:

The relevant directory is:
PoolSimulatorComponents/CameraAnalysis
Testing files are here:

Here are the commands I use for development purposes and debugging issues.
Static ball detection:
python detection.py --debug-detection --debug --debug-static --debug-offline

This runs static-image ball detection. The --debug-offline flag means no Quest 3 receiver is required. The image or folder can be passed with --debug-image, for example:
python detection.py --debug-detection --debug --debug-static --debug-offline --debug-image “/path/to/image-or-folder”

Static pocket visualization:
python detection.py --debug-pocket-display --debug --debug-static --debug-offline

This uses the static-image input path and visualizes detected/calculated pocket locations.

Static cue-stick visualization:

python detection.py --debug-cue --debug --debug-static --debug-offline

This visualizes the cue-stick detection, but this is not my current priority because I still do not have stable table-plane coordinates.

Recorded video input:

python detection.py --debug --debug-recorded --debug-video “/path/to/video-or-folder” --debug-offline

This replaces static-image input with recorded-video input.

Live phone input:

python detection.py --debug-pocket-display --debug --debug-phone --debug-offline

This connects to DroidCam and uses the live phone stream. This is where the instability appears.

Things I am considering:

  • Use normalized relative table coordinates instead of absolute image coordinates.
  • Add a QR marker as a stable origin, probably near one corner or outside the playfield.
  • Use a QR marker as (0, 0) and use its rotation to straighten the table coordinate system.
  • Use known camera height and measured table dimensions to constrain the homography.
  • Stop trying to detect pocket holes directly and instead detect the table boundary/rails, then place pockets from known geometry.
  • Detect pocket candidates visually, but accept them only if their distances match expected table geometry.
  • Use temporal median filtering instead of frame-to-frame pocket locking.
  • Save every attempted pocket detection frame with overlays showing detected pockets, expected pockets, homography determinant, and pixel/mm error.

Main questions:
What would be the correct robust computer-vision approach here?
Should I:
detect the playfield rectangle and derive the pocket positions from known geometry,
detect actual pocket openings visually,
use fiducial markers,
use QR calibration around the table,
or combine expected table geometry with local visual refinement?
How would you make this robust enough for live phone capture where the camera is fixed but not perfectly centered or perpendicular, and where lighting/shadows are not fully controlled?
Any advice about missing dependencies or assumptions would also help, especially around:

  • HSV masking,
  • homography validation,
  • temporal filtering,
  • table-edge detection,
  • pocket geometry validation,
  • marker placement,
    whether 4 mm accuracy is realistic in this setup.

Below you’ll find the detection I already make on static image(s). (Do not mind the not responding screen.)
Thanks.

what does the advisor say? do they limit their advice to formalities/ceremony, or do they have subject matter expertise?

have you attended lectures on computer vision? what techniques did they teach?


for registration (homography), you could do feature matching. that’s “traditional”. come up with an ideal image (illustration) of the pool table, match the camera view against it. that model image needs to be very much in agreement with the actual table. “looks like a pool table” isn’t good enough. these things have specifications. look them up and base your model image on those. verify that the table matches the specs too. if it doesn’t, maybe it’s manufactured to some other spec, or it follows no spec at all.

recompute the alignment every second or whatever interval feels sufficient, as a background thread, so it doesn’t interfere with per-frame processing. keep the homography as “state” and apply it to the per-frame processing. it’s fine to be “out of date” by a second or a few. neither the camera nor the table is liable to move all that much.

if you see that feature matching alone is jittery, you can improve registration via ECC refinement between camera view and ideal image. if you need to track fine movements in real time cheaply (I doubt that’ll be needed), you could run a correlation tracker (MOSSE) against patches located on the four corners of the table.

given the registered image of the table, you can use whatever you like to detect/locate the balls and cue. you could use a DL model for the balls and cue. or something more “pedestrian” that picks circular blobs that contrast against the table. the cue is also a blob contrasting against the table, but it’s long and thin. you could distinguish it from the balls using shape descriptors (“moments”).

you could detect the balls and cue in every frame, or you could track them, i.e. detect when needed, and then use a tracking algorithm to follow. OpenCV has a bunch of tracking algorithms. some are DL-based, obviously the most powerful. some are dumb (mistake one ball for another) but computationally cheap and extremely accurate (subpixel tracking possible).

if you want to refine the poses of balls, you can also use ECC refinement.

you can probably tell ball types apart (object ball, cue ball) fairly easily. telling object balls apart is probably more involved.

to deal with shadows, play around with color spaces. or see which background segmentation models might be able to handle that for you. or discard blobs that, by their shape, cannot possibly be a ball or cue.

as for accuracy, I believe that around one pixel of accuracy is achievable, whatever that is at the camera focal length and distance to the table, give or take maybe half an order of magnitude. this hinges on an accurate lens distortion model. calibrating cameras is not trivial. Calibration Best Practices – calib.io even those pieces of advice don’t guarantee a good calibration because someone who’s new to this stuff won’t necessarily understand the advice or implement it right.

My advisor is more a computer graphics expert, who know a little bit of theory and basics who can connect me with other professors at my faculty.
I have basic theory about computer vision so I can quickly make basic and ideal circumstances work. We haven’t had any particular techniquest, just the basic things like the Hough transform, Canny edge detector and Homography. I don’t think I can do anything with SIFT/SURF descriptors.
It’s only the live image that is giving me problems, since here are some results for pre-recorded footage. Recorded sample with balls solved satisfiably.
Regarding the other things I’ll reply tomorrow, since I am contemplating if I even need pockets for my Quest application. Anyway thanks for the hint, I’ll something and see how it works on live-media.

some kind of matching has to happen, or else you are gonna have trouble calculating a homography.

the only way around feature matching would be to stick markers for Augmented Reality to the table, and then measure their exact position relative to the table and relative to each other.

in your videos, I see QR codes. those are made to carry data, not to be located precisely. Look into ARUCO or APRIL tags. those two are supported by opencv.

I also see that detecting the balls appears to perform well. that’s good.

by the way, the homography need not be applied to the camera view directly. you can do all your detections in camera frames, and then transform the bounding boxes, contours, etc afterwards. that has the advantage of all the detections getting to work on a sharp image. applying a homography to an image will make it a little blurry because every pixel has to be resampled. the only issue with doing it like this would be if the camera view looks so much different from a perfect top-down view that detections will fail. that would be the case if your ball detector assumes balls to have a particular pixel size.

Thanks for your replies. About your proposal here is the following:

  • I agree with your proposal to match the features of the QR/ARUco maps via SIFT/SURF or even someting else. Regarding the table, as from my exeprience, I’ll probably won’t have access to an official tournament table that strictly follows standards, since I had a hell of a time even finding a “suitable” place now. I had my advisor visiting this place, in where he commented that for real evaluation even this won’t be good enough :grimacing: , but at least I can sort out development issues and do some application developing. The table is only valid by ratio, which is 2:1 and the relations with other “object” like diamonds and “pockets”. The diamond are detected by the same algorithm as the balls but are labelled as “unknown” and filtered out in the final display. The other measurements I took are the ones of the pockets, but for that I have a different idea of how to setup the environment inside the Quest view. It would require much more manual work (with saving the built environment for subsequent runs), but I’ll take just to bring the scope of the project down to a more timely managable project. Also I think that computer vision would be anyway useless here.
  • The camera and/or table are static objects, so once fixed they stand there forever, so no movement occurs, so that is why i am using cached values for now, but some old info would be OK as long it isn’t a singular matrix
  • I have some calibration implement, but not sure how well is that implemented, since the main camera sensor doesn’t need any additional calibration, since the even the phone standard settings only have the lens calibration for the ultra wide and the front camera. I’ll focus on that once I got everyting else working, since this is only polishing afterwards.
  • Yes I have QR code, since, they are support by Meta Quest 3 and can be also detected. I was contemplating to have 1-2 code(s), to add some relative points from 0 to 1, so I can do this more efficiently, but ARUCO/APRIL flags are still being considered. I am aware of this project GitHub - TakashiYoshinaga/QuestArUcoMarkerTracking: This repository enables single and multi-marker detection and tracking using OpenCV with the Passthrough Camera API for Meta Quest 3/3S. It provides sample scenes that support both ArUco(Single/Multi) and ChArUco(Single) markers for augmented reality applications on Quest devices. · GitHub which also uses AR UCO marker tracking, but this point I am little bit worried by the whole performance. I guess I have to generate something around ~13cm to be seen by the phone at my current height (which unfortunately cannot be changed).

I am still having some issues regarding the whole pipeline. On live streaming image I am unable to get a non singular homography matrix, which is not a good sing.I need to do a sanity check regarding this, but don’t know what exactly to do.