Wow, this is an order of magnitude faster, thanks!
I employed both optimizations you suggested:
- I use the combination of
cv::initUndistortRectifyMap
(called once) andcv::remap
(called each frame) instead of the originalcv::undistort
. This was a thread that helped me understand how it’s supposed to be used:
c++ - Right use of initUndistortRectifyMap and remap from OpenCV - Stack Overflow - I use the T-Api (meaning I replaced all Mats with UMats) so that OpenCL acceleration can kick in.
The result is perfectly smooth video feed. Massive thanks!
As for the university course… Yep, I remember pulling my hair out as to why the project still ran so slow and not even the teacher was able to tell why. He just said, that it should run pretty fast and that he couldn’t see where the problem was. My solution was just about 4 times faster than the reference CPU version, but the best guy was about 50 times faster than the CPU version. What I learned is that you have to build an intuition for that specific platform. All those different CUDA architectures, different cache hierarchies… also the communication speed with the ARM CPU would make difference. So yeah, I get the point.