Computer vision algorithms on a microcontroller

I’m trying to do the following on NXP 1050 RT 1. Calculating depth map 2. Image stitching and panorama view The current problem is I can’t find ready made library and I have to write everything from scratch. I searched for opencv like libraries but couldn’t find any on the internet. What would be the best approach for tackling the above constraints and writing the algorithms ?

That’s a microcontroller btw

Hi @andreahmed , and welcome to opencv forum!

There is anotherquestion on OpenCV on MCU with an interesting article.

As far as I know, if you can install Linux, you can compile OpenCV. Using Embox you can avoid using the heavy Linux kernel.

While compiling OpenCV, you should pick only the libraries you need to reduce compiled binary size.

does your device have cameras attached? how are they attached, MIPI CSI?

consult with NXP. what you need is “Digital Signal Processing” basically, acceleration for 2D image manipulation specifically. I see that it has something like that. you’ll want to use those specific resources.

I don’t see that your device contains a hardware video codec. that’s unfortunate. a video codec contains motion estimation, which is block matching, which is the core algorithm in the calculation of disparity maps (stereo vision).

all the tasks you mentioned require enormous amounts of calculation. be prepared for things to crawl.

Thanks so much for your answer, it’s very informative.
Yes it has MIPI CSI, I’m going to use NXP 1050RT.
Can you elaborate more how would motion estimation related to block matching and doing depth measurement ?

i.MX 8 Family Applications Processor | Arm Cortex-A53/A72/M4 | NXP Semiconductors
How would the 2D GPU help in the tasks that I want to do ? like depth, and image sitching

that thing says it has “a powerful vision pipeline”. that’s good.

I see mention of OpenGL and shaders and “compute units”. that’s very promising.

2D operations help because all you have is images, which are 2D. stereo vision is all about warping and resampling images and comparing parts of images.

convolution/correlation is the most (computationally) expensive component in stereo vision. it’s a dumb operation though, so it is suited ideally for a GPU. CPU cores come with complexity that is wasted on such simple number crunching. that device having multiple CPU cores is nice but the GPU is probably more important for your application.

I see H.264 encode. good. so that’s in principle powerful enough to estimate motion (optical flow), at least using a “fixed function”.

all commonly used video codecs work by encoding how parts of a picture move from frame to frame. decoding is easy, just move the parts and make the block edges disappear. encoding is hard because for every part of the picture you have to look where it came from in the previous frame. that is block matching. video codecs do that within a limited range, but they have to look in all directions.

stereo vision only requires looking along one axis, so that’s “cheaper”.

the disparity map says for every point in one eye, how many pixels to go left/right to find that point in the other eye. knowing that disparity, the distance between eyes, and some camera parameters, you can calculate the distance of that point (triangulation). now that is a depth map because it contains depth expressed in some unit of length.

granted, there’s probably no good way to re-purpose a (hardware) video codec’s motion vector output for this. one factor is that the codec doesn’t stick to the one axis, and even if the data probably causes all motion vectors to be along one axis… eh it’s an idea.

that was mostly to assess the computational power of the device. it having a GPU is a lot more indicative of whether you can do real-time stereo vision with it.

you should contact NXP. they’ll probably be happy to hear about your project and perhaps help you with specific advice, even what parts to select.

1 Like

Thanks for your input. Very helpful.
Currently I’m tasked to do depth measurements, and calculating the depth map.
I have written a semi global matching algorithm which can be executed without OpenCV libraries,…etc…

The problem it takes a lot of computational around 300ms even after using SIMD, and parallel processing. That’s for a 400*400 picture.

I would like to know what are the tricks that those people do to calculate depth image and do image processing that fast on an 1080p image ?

you haven’t shown anyone do stereo vision on those devices. your question simply assumes that.

hardware video codecs use hardware, i.e. silicon. that means they don’t use the CPU. they have fixed functions “implemented” in silicon. it is trivial to implement massively parallel algorithms in silicon. that obviously runs a lot faster than using a general purpose CPU and making it do the same operations using its general structure.

I hope you understand the difference between a video codec and stereo vision. a video codec encodes video data into a data stream.

parts of it might be repurposed (abused) to do parts of stereo vision, if you’re willing to do research.

but there are many VR glasses out there that they do stereo vision, and they do some ML and depth measurement like using it for obstacle detection.

“VR glasses” themselves don’t “do” stereo vision. they show you two pictures. your eyes/brain perform the stereo vision.

some VR/AR headsets use one or multiple camera to localize their own position in space. some of those use infrared beacons, which greatly simplifies the problem, because the cameras only see those IR beacons, not a whole scene.

depth sensors have many implementations. “two cameras” is one. “camera + projector” is what a Kinect uses.

ML, meaning Machine Learning, has no place in any of this. what makes you think it would? and why would “VR glasses” need obstacle detection? they don’t?

I can’t give you a course on this or correct any preconceptions you might have. you’ll have to do your own research.