Difficulty trying to find the orientation of detected objects using YoloV5

Heya, i’ll keep this post as short as possible, but i’m working on a sorting robot that needs to sort Dutch fryable snacks. I’ve trained a YoloV5 model, which works very well for detecting which snacks are where, but in order for me to use a robot arm to actually sort the snack, i will need a location, and how many degrees the snack is rotated.

Because a bounding box using YoloV5 is always recatangular, i needed a different way of detecting the orientation of the snacks. For that i used basic OpenCV functions to filter out the backgound, and create contours of the remaining snacks. The problem with this, is if snacks overlap, or are directly against eachother, it will detect 2 snacks as 1 big snack.

I’m trying to find better solutions to find the orientation of the snacks, i was wondering if any of you people know any tips to find a solution to this.

PS: If this is too vague, i’ve also created a PDF with pictures and more context.

you’ll want instance segmentation. ultralytics probably have a solution for that. might be one of their later-version yolos (I don’t know).

that’ll give you a proper mask, per instance (object). no clumps.

assuming the objects are convex, you can then just pick the centroid. if not, you’d have to run a distance transform and pick the maximum-distance point (which is “deepest” into the object’s contour).

a proper mask (or contour, equivalent) also lets you get an oriented bounding rectangle.

a mask/contour also lets you calculate the major axis. if it’s a square, that’d be silly of course.

you could extend the DNN for object detection/instance segmentation to also directly infer the orientation (like it does the bounding box).

Heyo, thx for the quick reply. I’ve just looked at segmentation, but one problem i think i’ll be encountering if i follow that route, is what if i have several items stacked on top of eachother. I don’t think their YoloV8 segmentation can detect moments like that.

I don’t really see how segmentation could see 2 snacks stacked on top of eachother as 2 different snacks, instead of 1 big snack. (Pic included)

instance segmentation is a special case of semantic segmentation.

semantic segmentation would just paint both those frikandellen as “frikandel”.

instance segmentation would hopefully recognize two, and paint both “pieces” of the bottom one as being the same instance.

early instance segmentation approaches first ran object detection, yielding bounding boxes, then built a mask for each detection. this is a “dumb” approach but it seems to work okay.

you don’t need instance segmentation, or any kind of segmentation. object detection should have given you two (overlapping) boxes for that picture. you would have trouble figuring out which one is the topmost one. instance segmentation could tell you which one is not occluded, so you can pick that one first.

for this case, I’d recommend training just object detection that doesn’t infer a bounding box, but

  • both endpoints of the shape (requiring 2+2=4 scalar outputs)
    this implies the center point for picking and orientation+length of major axis
  • a useful estimate of the object’s occlusion
    it should tell you whether the object is “topmost” (good to pick) or not.

other PNP applications that are a lot more cluttered also iterate, i.e. they first detect topmost parts, pick all those, then detect again, with optional shuffling/shaking between runs.

of course, since you’re making frikandellen, you could cause the production line to not just throw them on the belt like Mikado sticks, but separate and align them mechanically.

1 Like

I just noticed that you linked a PDF in your first post.

for several different types, you might want the network to infer an oriented bounding box consisting of

  • center point (2 scalars)
  • orientation of major axis (radians -\pi to +\pi, or \sin and \cos values?)
  • width (minor axis length) at least, so your gripper knows how much to open
  • length optional

(and the encoding of the object classes)

this just requires a change in the last layer and whatever processes the outputs from that layer, and the training data needs adapting to this format.

in case objects touch, the gripper’s jaws would have to reach in and exactly hit the gap, so the items aren’t damaged.

perhaps the ability to shake the box would help. shake and look until at least one object is free-standing. that should also help with clearing up occlusion.