Cv::undistort gpu acceleration

Wow, this is an order of magnitude faster, thanks!

I employed both optimizations you suggested:

The result is perfectly smooth video feed. Massive thanks!

As for the university course… Yep, I remember pulling my hair out as to why the project still ran so slow and not even the teacher was able to tell why. He just said, that it should run pretty fast and that he couldn’t see where the problem was. My solution was just about 4 times faster than the reference CPU version, but the best guy was about 50 times faster than the CPU version. What I learned is that you have to build an intuition for that specific platform. All those different CUDA architectures, different cache hierarchies… also the communication speed with the ARM CPU would make difference. So yeah, I get the point.