I am working on a system on a raspberry pi that uses a YOLO instance segmentation model to classify different foods and mask them with opencv. I then want to use the detected classes to find the weight of each food, and add it to a total counter. The camera will most likely be facing top down, so I am curious what the best way to find the depths of the food is. Currently, the code I have now just takes the 2d mask, so it just takes the mask of the object straight down and then finds the weight from the area of that mask. This isn’t accurate because we are missing depth, and I need the system to be as accurate as possible.
What are some possible low-cost, yet effective solutions I could use to find the volume of foods and not just the area. There could be multiple foods on one plate in the frame, and they would all have different shapes & sizes. They will most likely be breakfast foods, so scrambled eggs, tater tots, french toast, etc.