Position and rotation tracking of defined objects: Template matching, feature detection, YOLO or something different?

Hi,
I am new to image processing and not sure which way would be best for my problem, maybe someone can hint me in the right direction.

What problem am I trying to solve?
I want to track the position with rotation of a known but complex object with known size and no scaling in a Full HD real-time stream.
An example frame might look something like this:


Note that the orientation is only defined by the one small hole top right.

After identifying the position and rotation, I want to overlay an image matching the contours and marking a place on the object, for this object an example overlay would be this:
https://imgur.com/li2kQU3
The resulting image should be this:
https://imgur.com/sWeNa6v
Sorry for the imgur links, but I have only permission to embed one image as a new user.

If the object is moved (and rotated) the overlay should track the object and adjust the position to overlay the object with correct orientation again.
The overlay itself is no problem, with some masking I can place it fast onto a frame, but the tracking, especially rotation, is tricky.

The application will run on Windows 10 systems without dedicated GPU and only a Intel G4400T, 2 Cores @2,9 GHz. Better systems are possible if needed, but they need to be fanless, therefore no big dGPU. Different systems like a fanless Jetson Nano are possible, too.
The tracking can have some delay (1s max would be great).

Since everything is in a defined environment with known background and known object, my first guess would be, that a DNN like YOLO is not needed.
But the normal OpenCV template matching is not up the to task, I think.
I can not reduce frame size more than half size, because the small features important for detection are lost if down scaling more.
I’ve tried “brute force” template matching by defining and matching 360 different templates and selecting the best match to get the orientation. This is working, but very slow. It would take several seconds to match one half sized frame 360 times and the resolution of only 1° is not optimal.

Is this a task that could be better, especially faster, achieved with YOLO? Would I be able to train a model on a powerful system with dGPU and run it on a slow CPU-only system?

Or would I be better of with a different approach like feature detection? I’ve tried a bit with ORB, but didn’t get correct matches and results, probably because I didn’t configure it right.

A different approach could be to detect the location and rotation once and then try to track the movement and therefore position and rotation in frequency domain, but I have no clue if this could work, just an idea.

Or should I try with optimizing the matching?
It would be faster to detect the object, extract the ROI with object and then only match against the ROI, not the whole frame.
Maybe even faster: Threshold the ROI and then compare this binary image with numpy against 360 binary templates: The rotated binary template with least difference to the binary frame ROI is the detected rotation. But the difference between correct/false would be quite small, artefacts may undermine this solution.
Instead of comparing images I could compare contours, maybe this would be even faster?
On a system with more cores I could use multiprocessing to share the workload for faster results.
But I don’t know how complex and error prone an implementation like this would be.

The example shown is only one possible object, it should be possible to add new objects by adding a new template and not by adding some hard coded manual detection.

I am sure my problem is solvable, maybe someone with more experience can give me some guidance, what a good solution could work with.
I don’t need or want someone to solve my problem and write a solution, but some hints for which way to look best would be greatly appreciated!

sounds like you want “template matching” as it is understood in “machine vision” (packages like mvtec halcon, cognex, …)

OpenCV doesn’t have that, as far as I know. nobody has bothered implementing those proprietary algorithms that are the main selling point of those commercial packages.

this can be solved like so:

  1. thresholding
  2. consider the outermost contour for its center
  3. take a ring of samples (polar transform but more constrained)
  4. analyze the ring profile to find the holes
  5. consider the distances between holes to figure out the orientation

do you have real pictures to work with? or else: can I assume that those pictures will have decent contrast so thresholding will give a decent segmentation?

1 Like

Thanks, yes it is a machine vision application I am trying to develop.
Unfortunately I am not allowed to share real pictures, but I’ve made the example as realistic as possible with a mock-up cad model. Contrast will be good: the background is dark, and the objects are bright metal. After conversion to HSV and using inRange it’s easy to create a binary image/mask.

I am sure the solution you posted will work for this example, but if I want to add other objects, for example a similar round part, but with the orientation defined by a notch, or a different rectangular part, I would need to hard code a new detection for every new object.

Do you have an opinion on some of my other ideas, especially using DNNs?

I’ve looked at YOLO and DNNs a bit, and I think this might be a possible option:
From what I understand, I can train a pytorch model with GPU and then convert it to be used by the OpenCV DNN module with CPU.
I could make the models small by only training one object in one model, the application knows which object will be in a frame by user input and can load the fitting model.
Different ideas for approaches would be:
Using a YOLO version with rotation:
https://github.com/BossZard/rotation-yolov5

Or get the orientation by training on different parts of the template, detecting those parts and get the orientation by the geometrical relations of those parts, like shown in this video:
https://www.youtube.com/watch?v=eFsljRvPHp0

no, misconception. the model won’t get significally smaller with less classes in it.
95% of the weights are in the bottom (conv) layers, less classes (in the top layers) wont change those.

1 Like

that’s doomed because metal will reflect any which way. unless it’s dulled/matte and truly brighter than the background everywhere.

make the background lit and translucent, and the foreground dark. that’s the only way to get reliable contrasts.

a DNN might help you find the object but that’s a trivial task already. it is unlikely to help you determine the object’s orientation/rotation. you might have luck training it to find a notch, if there’s no other notches. it will have great trouble recognizing one particular hole among many identical ones, just from a sequence of holes. that requires large receptive field, or processing that is spatially unconstrained (i.e. symbolic).

you should go talk to a representative from mvtec or cognex. you seem to want an easy solution. they have easy solutions.