compiled against CUDA 11.0 and 11.4. The performace on CUDA 11.0 was 25x better, results below:
CUDA 11.0
[ RUN ] Sz_Flags_Dft.Dft/0, where GetParam() = (1280x720, 0)
[ PERFSTAT ] (samples=25 mean=25.80 median=25.73 min=24.68 stddev=0.66 (2.5%))
CUDA 11.4
[ RUN ] Sz_Flags_Dft.Dft/0, where GetParam() = (1280x720, 0)
[ PERFSTAT ] (samples=13 mean=1.18 median=1.19 min=1.10 stddev=0.03 (2.9%))
This may not be enough to make the CUDA varient faster than the CPU one on your system with the sizes your using but it should bring them closer together.