Box detection method?

Hello everyone :smiley:,

I’m working on a project where I have to count the number of drink crate boxes like shown in the image below:


Note: The aruco markers in the picture are irrelevant to this project.

Specifically, I would like my program to detect the number of boxes for each type of drink not only for this image but for different number of boxes and type of drinks.

I have tried some methods like Circle Hough Transform for detecting the bottle caps, but it has a lot of issues when trying different images.

Do you know any possible solution to this problem?

Thank you in advance!

magic (AI) solves everything.

good luck with stacks that you look directly down on. can’t see what’s behind the top bottle crate.

if you needed to solve this without magic, you’d need different camera angles, with a decent view of the stacks’ sides.

if you knew the the base of a stack is always in the same position, you could just detect the top of the topmost crate and then calculate with parallax (xy shift)… or how large that rectangle is (scaling because z).

your camera has severe lens distortion so you’d benefit a LOT from calibrating its distortion. I hear the easiest method today is to use charuco boards because with those you don’t need to keep everything in view like you have to with checkerboards.

oh and those arucos there… if you don’t need the bits, just go with 4x4. gives about a factor of 6/4 more range for detection.

and maybe tape the business card of the nearest dentist to the sugary stuff there.

Thank you @crackwitz for your reply and your dental care tips :D.

What kind of AI/magic would you apply here? Training a CNN for detecting boxes? Or bottle caps?

Regarding the approach you mentioned with the parallax, could you explain it a bit more
so that I’m sure I understand correctly? Do you mean using a stereo camera looking on the
side of the boxes?

I get why you say that it’s better to have 4x4 a dictionary, but I saw a post
of one the aruco developers in this stackoverflow thread, recommending a certain 6x6 dictionary.

About the camera distortion, is it always that helpful to calibrate the camera in CV applications? Or is it only necessary when wanting to detect objects with certain shape like markers?

I didn’t mention in the description that just a good approximation on the number of boxes for each type of drink is ok.

yes. all of that, in one network. the network obviously learns all the appearances of bottle crates, which includes empty ones, partially bottled ones, fully bottled ones, with and without liquids in them, with and without caps on the bottles, standing insolated, being top in a stack, in a stack, bottom of the stack, standing beside others, with all amounts of occlusion.

that’ll take some labeling. check results sometimes, and if you notice grave errors, you make more training data out of that.

parallax: no, monocular is enough for that. consider the leftmost stacks in your picture. imagine them being 1/2/3 crates high. the rim of the top crate would “go leftward” the higher the stack is.

3D data (stereo vision, kinect, time of flight) would transform your problem. now you’d have point clouds. you could literally measure the height of each stack. same as the crates, this is a tall stack of technology, so there is plenty chance for things to degrade or fail at every level. if you choose to investigate this path, I’d recommend against building stereo vision yourself. just get an RGBD sensor and start messing with the point cloud data.

aruco… hm well ok, I always hate to bow to authority so I’m disinclined to take his claims at face value (or to precisely analyze what he meant and what he didn’t mean, and what you derived from it), especially since they don’t mesh with my practical experience with machine vision (I did gain more range from exchanging 6X6 for 4X4). when one needs to see arucos at 10-20 meters distance, with an average camera, and the codes have to fit on a sheet of paper, they’ll appear to be a few tens of pixels long on a side, and then Nyquist himself materializes from between the sample points and literally shakes your hand. when the camera can’t even resolve the modules (to borrow a QR code term), no imaginable algorithm can do shit about it. enough of that tangent. you appear to deal with comfortably close distances here.

lens distortion is an issue if you need accurate quantification of lengths. for just object detection, it hardly matters.