Slow DFT with CUDA

Looks like Nvida changed something in the most recent versions of CUDA, see

an d

When i profile the function the kernels are super quick however it spends >10ms in calls to cuModuleLoadData, cudModuleUnloadData etc.

You could try reverting to CUDA 11.0 to see if that improves your performance.

I just tested the performance test

opencv_perf_cudaarithm.exe --gtest_filter=Sz_Flags_Dft.Dft/0

compiled against CUDA 11.0 and 11.4. The performace on CUDA 11.0 was 25x better, results below:

CUDA 11.0
[ RUN ] Sz_Flags_Dft.Dft/0, where GetParam() = (1280x720, 0)
[ PERFSTAT ] (samples=25 mean=25.80 median=25.73 min=24.68 stddev=0.66 (2.5%))

CUDA 11.4
[ RUN ] Sz_Flags_Dft.Dft/0, where GetParam() = (1280x720, 0)
[ PERFSTAT ] (samples=13 mean=1.18 median=1.19 min=1.10 stddev=0.03 (2.9%))

This may not be enough to make the CUDA varient faster than the CPU one on your system with the sizes your using but it should bring them closer together.