Object detection - Haar cascades vs. TemplateMatching vs. FastFeatureDetector vs.?

Hello everyone!

Since this is my first post I’d like to introduce myself before asking questions :wink:
I’m from Switzerland, 37 I guess (did not check in a while). I’m a Testautomation Engineer who decided to go back to school. So I started my bachelors study in IT some time ago while working part time.

In one of the modules I’m attending to, the team is constructing a small robot which has to analyse a stair, calculate a path around some obstacles and finally climb it according to this calculated path. This calculation has to be done exactly once on one picture as shown below.

My question here is regarding the detection of those obstacles which are placed on the stair.

As you can see in the image there are two colors of stones used as obstacles one of them almost matches the color of the stair wich makes it hard to work with colors at all.

I have a bit of experience working with HAAR cascades since we used this method in the same project for another task. I therefore went for the same approach with the stones. Unfortunately I was not able to get a good result out of it. I could of course take more pictures and get a larger dataset but this is very time consuming and it did not seem the right way.

As a second approach i tried a Fast Feature Detector to get all the corners. But as you can imagine, even tuning the Threshold you get way to many resulting points, almost unable to group and identifying them as stones. Especially because of the background which might also change in the future.

Asking the school staff we were told that using Template Matching techniques could also be a possible solution. I honestly did not try it out yet since time is getting shorter and shorter. But I’ve read the theory of it.
In my opinion, template matching is not the best option here since we’d need to have many templates of stones in every possible position in order to get an accurate reading.

As there are surely people here who have a lot of experience in working on different approaches I ask you what way you would choose to tackle this challenge.
Is there the one right thing to do?

Thank you in advance and best Regards

template matching in opencv are not invariant by translation, rotation or scale. It could be a problem I think.
Deeplearning is forbidden?

Hi, Thank you for your thought. No Deeplearning is not forbidden. But it’s a Raspberry Pi 3 B+ which has to cope with applying the model in the end. We can not spend money on a TPU which is why we chose a different path.

What Framework / Model do you have in mind for the given task? (I don’t have any experience with DL)

I have to be blunt. they are lying to you. that’s never ever gonna work. not as a 2D method on a picture of a scene of this complexity. or they’re incompetent. or they call advanced methods (DNN object detection) “template matching”, which they aren’t.

a monocular picture isn’t good enough.

you need 3D data for you planning. that means you need some kind of depth sensing. it could be a stereo camera, or some kinect-type sensor, or a time of flight camera, or lidar, or a structure-from-motion approach (i.e. video).

… or use a DNN that hallucinates depth information into a monocular picture. yes, hallucinates, that’s a technically accurate description.

1 Like

No offense, this April 1st post seems like an April 1st joke.

I suggest you to replace raspberry with a ZX 81. That’s should be a real challenge

I would not take the effort to produce such a detailed post about a problem if it were just for a joke.

Therefore I’m still stuck with the original Question and I don’t really know where to go from here.
Structure from Motion sounds like the only thin we could do since we can not spend more money on equipment. but honestly I don’t see how we could then calculate the path.

Mapping the /==\ Stair to a square |==| , knowing where the obstacles are and how many steps there are, we could calculate the shortest path with dijkstra. So we just need to spot and mark those stones.

DL → TF-Lite with mobile net? As I wrote. I have no experience using Deep Learning…

with raspberry off line it’s possible : time consumming is greater than inference
You can train network on regular PC and use it for inference onrpi
If it’s always same stair why don’t you use a simple substraction (using edges)?


I’m sorry for that, my bad.

I’m lacking some background, as the general statement “trace a 3d path from only one picture” sounds like a doctoral thesis. I believe this is an original project no one did before. Correct me if I’m wrong.

I understand you won’t get a real and robust solution for all stairs, but a minimal proof of concept working with these bricks on this particular stair. And that you aren’t planning a 3D path, but a 2D path on the stair plane, using discrete stair’ steps instead of depth.

Structure from motion needs more than one picture, then it’s out of the question. Fast feature detector looks for its own features, and those bricks are pretty flat, featureless for FAST. I don’t see FAST helping.

There is no obvious way to do it. I will suggest some ways, each of them need some effort to test.

Development with deep learning is very time consuming. In a rush I would train an UNet for brick segmentation, because it can be training with very few images (like 50 original images, augmented). You can train in Google Colab without spending money, and then execute the trained model on your raspberry pi. But to be fair if you don’t have experience with it, starting with deep learning will take a lot of time and effort, and training a network is not the first step, but the last.

Without deep learning, I should try to detect those bricks in two ways:

a) Looking for flat surfaces: I can see the bricks are somehow flat and the background is not

  • apply vertical and horizontal Sobel filters
  • adding their absolute value to get edges
  • threshold for low absolute value, to get a non edge segmentation
  • morphological operations to denoise and get rid of flat areas smaller than those you are looking for
  • get contours to filter by area and shape and get the position of each brick

So you get a pixel coordinate for each flat side of each brick. May be compound two or three areas in one brick.

b) composing straight lines, to get the brick edges

  • use fast line detector to get a bunch of straight lines. Better and more performant than Hough
  • filter by length, by area (discard lines outside the stair) and any other method you come up with. You will need a lot of filtering.
  • look for the vertical lines first, this can be a distinctive brick feature
  • then try to connect the other lines to form a brick

It isn’t easy, you’ll get a lot of undesirable lines, and many bricks lines won’t touch in the vertices.

You can apply several imperfect methods and join them all. Segmenting by color will give you a lot of false positives, but you can combine it with flat surface binary image and may be with lines nearby, the most votes, the more likely to be a brick.

About subtracting, this is only possible if you have the camera fixed in the same position. If the camera is on your mobile robot, this is hardly an option.

I hope this helps.


No problem for the 01.04 thing :wink:

Thank you very much for your detailed answer. It should not be a doctoral thesis but it would be an interesting one indeed!

Yes you are right. We thought that if we map the perspective/trapezoid 2D geometry of the stairs to a square and use each step as a “divider” we end up with something like a chess board where stones mark fields where we can not step upon. → Reducing the 3D problem to a 2D space.

I’m especially interested in the deep learning approach but I understand that it might take to much time to reach a level where I understand what I’m doing which is why I keep this on the shelf for later.

I already played with Hough Lines in order to grab onto the vertical short lines of the stones but with not a lot of success. I’m happy to have learned that there is a Fast line detector. So this might be the next Step for me to try. There are at least 2 and a maximum of 3 parallel lines of interest for each stone. I think this is one of the key features we can grab onto.

It might take some time - but I’ll write again what we actually did and if it worked out :slight_smile:

Thank you!

1 Like

After working for some time with different filters to extract the position of those :face_with_symbols_over_mouth: bricks we gave it up…

I came back to the first reply, read a bit and gave it a try with roboflow and yolov5s

After just one day I had a model which works good enough for our situation.

The only thing left is to run it from RPI :slight_smile: luckily we only have to run it once on one picture

Thanks again for your input

1 Like

You did it! That’s great!