OpenCV Optical Flow Cuda Naiva Implementation Slower then CPU

The naiive implementation (Naive CUDA implementation without pre-alloc, streams or other optimizations)

  1. allocates the return arrays on the GPU in each iteration which is costly, and
  2. calls cudaDeviceSynchronize (hard sync) on every iteration because you are not passing a cuda stream. As a result the timing will be off if you timed it with the code from that notebook which uses CPU not GPU timers. This due to the synchronization will include the latency (time between calling optFlow.calc and it execution) of every kernel launch.

That said I have no idea if the code will be faster on your RTX 3070 than your Ryzon 7 2700.