Slow DFT with CUDA

Non-cuda code runs much faster then CUDA code…

		cv::dft(sourceComplexImage, dft, cv::DftFlags::DFT_COMPLEX_INPUT, 0);

vs

		cv::cuda::Stream Stream;
		cv::cuda::GpuMat gpudft;
		cv::cuda::GpuMat gpusourceComplexImage(sourceComplexImage);
		cv::cuda::dft(gpusourceComplexImage, gpudft, size, cv::DftFlags::DFT_COMPLEX_INPUT, Stream);

		Stream.waitForCompletion();
		gpudft.download(dft);

Any thoughts on why I would expect the CUDA code to run much faster? The dft image size is 640x360.

GPU is RTX2060.

The slowness is in the DFT call not the transfer to or from the CPU.

Aaron

Looks like Nvida changed something in the most recent versions of CUDA, see

an d

When i profile the function the kernels are super quick however it spends >10ms in calls to cuModuleLoadData, cudModuleUnloadData etc.

You could try reverting to CUDA 11.0 to see if that improves your performance.

I just tested the performance test

opencv_perf_cudaarithm.exe --gtest_filter=Sz_Flags_Dft.Dft/0

compiled against CUDA 11.0 and 11.4. The performace on CUDA 11.0 was 25x better, results below:

CUDA 11.0
[ RUN ] Sz_Flags_Dft.Dft/0, where GetParam() = (1280x720, 0)
[ PERFSTAT ] (samples=25 mean=25.80 median=25.73 min=24.68 stddev=0.66 (2.5%))

CUDA 11.4
[ RUN ] Sz_Flags_Dft.Dft/0, where GetParam() = (1280x720, 0)
[ PERFSTAT ] (samples=13 mean=1.18 median=1.19 min=1.10 stddev=0.03 (2.9%))

This may not be enough to make the CUDA varient faster than the CPU one on your system with the sizes your using but it should bring them closer together.