I wrote up a bug for the NVIDIA NPP…
These are the numbers I saw…
As you can see by my data the CPU version slows minimally based on kernel size but the GPU falls off a cliff in terms of speed after kernel size grows past 32.
At kernels size of 32x32 GPU and CPU are similar at larger CUDA falls apart. This is similar for many filter types.
Performance numbers using openCV CUDA performance test
Using CUDA (Via) NPP
[ RUN ] Sz_Type_KernelSz_Blur.Blur/55, where GetParam() = (1280x1024, 32FC1, 32)
[ PERFSTAT ] (samples=25 mean=2.37 median=2.34 min=2.33 stddev=0.07 (2.8%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/55 (67 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/56, where GetParam() = (1280x1024, 32FC1, 64)
[ PERFSTAT ] (samples=13 mean=8.72 median=8.68 min=8.38 stddev=0.25 (2.9%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/56 (120 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/57, where GetParam() = (1280x1024, 32FC1, 128)
[ PERFSTAT ] (samples=10 mean=33.78 median=33.61 min=33.25 stddev=0.60 (1.8%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/57 (343 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/58, where GetParam() = (1280x1024, 32FC1, 256)
[ PERFSTAT ] (samples=10 mean=132.97 median=132.77 min=131.68 stddev=0.77 (0.6%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/58 (1335 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/59, where GetParam() = (1280x1024, 32FC1, 512)
[ PERFSTAT ] (samples=10 mean=547.32 median=547.15 min=546.28 stddev=1.04 (0.2%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/59 (5479 ms)
Using CPU
[ RUN ] Sz_Type_KernelSz_Blur.Blur/55, where GetParam() = (1280x1024, 32FC1, 32)
[ PERFSTAT ] (samples=20 mean=2.16 median=2.13 min=2.12 stddev=0.06 (3.0%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/55 (47 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/56, where GetParam() = (1280x1024, 32FC1, 64)
[ PERFSTAT ] (samples=19 mean=2.22 median=2.19 min=2.19 stddev=0.07 (2.9%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/56 (45 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/57, where GetParam() = (1280x1024, 32FC1, 128)
[ PERFSTAT ] (samples=18 mean=2.66 median=2.64 min=2.57 stddev=0.08 (3.0%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/57 (51 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/58, where GetParam() = (1280x1024, 32FC1, 256)
[ PERFSTAT ] (samples=13 mean=3.48 median=3.48 min=3.38 stddev=0.06 (1.7%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/58 (48 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/59, where GetParam() = (1280x1024, 32FC1, 512)
[ PERFSTAT ] (samples=13 mean=4.66 median=4.65 min=4.55 stddev=0.08 (1.7%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/59 (64 ms)