yes. all of that, in one network. the network obviously learns all the appearances of bottle crates, which includes empty ones, partially bottled ones, fully bottled ones, with and without liquids in them, with and without caps on the bottles, standing insolated, being top in a stack, in a stack, bottom of the stack, standing beside others, with all amounts of occlusion.
that’ll take some labeling. check results sometimes, and if you notice grave errors, you make more training data out of that.
parallax: no, monocular is enough for that. consider the leftmost stacks in your picture. imagine them being 1/2/3 crates high. the rim of the top crate would “go leftward” the higher the stack is.
3D data (stereo vision, kinect, time of flight) would transform your problem. now you’d have point clouds. you could literally measure the height of each stack. same as the crates, this is a tall stack of technology, so there is plenty chance for things to degrade or fail at every level. if you choose to investigate this path, I’d recommend against building stereo vision yourself. just get an RGBD sensor and start messing with the point cloud data.
aruco… hm well ok, I always hate to bow to authority so I’m disinclined to take his claims at face value (or to precisely analyze what he meant and what he didn’t mean, and what you derived from it), especially since they don’t mesh with my practical experience with machine vision (I did gain more range from exchanging 6X6 for 4X4). when one needs to see arucos at 10-20 meters distance, with an average camera, and the codes have to fit on a sheet of paper, they’ll appear to be a few tens of pixels long on a side, and then Nyquist himself materializes from between the sample points and literally shakes your hand. when the camera can’t even resolve the modules (to borrow a QR code term), no imaginable algorithm can do shit about it. enough of that tangent. you appear to deal with comfortably close distances here.
lens distortion is an issue if you need accurate quantification of lengths. for just object detection, it hardly matters.