Blur is not a member of cv::cuda

Have you checked using the exact same parameters, sometimes it can make a big difference? It still sounds like the way you measure the timing may be out, hundreds of ms sounds a lot like one of the initialization runs, and or a bug in the way you are calculating your timing.

Not a bug in the timing calculation. It is a mistake in understanding the cuda implementation vs non-cuda.

For kernels of less then 32x32 cuda implementation is faster once you get above 32x32 (in my case the test I had was 500x 500 the CUDA implementation is way slower.

I modified the opencv performance tests.

500x500 CUDA blur (boxfilter) is…
[ RUN ] Sz_Type_KernelSz_Blur.Blur/23, where GetParam() = (1280x720, 32FC1, 500)
[ PERFSTAT ] (samples=10 mean=347.66 median=347.28 min=346.61 stddev=0.99 (0.3%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/23 (3480 ms)

By comparison blur non-CUDA…

[ RUN ] Size_MatType_BorderType_blur500x500.blur500x500/39, where GetParam() = (1280x720, 32FC1, BORDER_REFLECT101)
[ PERFSTAT ] (samples=10 mean=3.53 median=3.50 min=3.41 stddev=0.10 (2.9%))
[ OK ] Size_MatType_BorderType_blur500x500.blur500x500/39 (38 ms)

This was my confusion I was using a 500x500 kernel in my code so now I am doing an apples to apples comparison and realize the the CUDA difference vs non-CUDA.

The other minor note I have a RTX 2060 not an RTX 2070.

Thanks for everyone’s help. But it looks like I tried to move the wrong bit of code into the GPU. We often have large kernels.

1 Like

a simple box blur can be implemented to scale the same on a GPU as on a CPU.

a general filter (with interesting values in the kernel) is more complex than a box filter (filter is all ones, scaled).

I don’t know who wrote what code for CUDA there. if you need this to run fast, it’s possible.

I am not sure that this would be the case. Although I can’t see the implementation for boxfilter because it uses npp under the hood, the normal procedure with CUDA for a filter of size 3 would be to load a neighbourhood of pixels say 33x33 into shared memory (user managed cache) which can be accessed by say 32x32 threads. Each thread can then access the region the filter operates on from the cache instead of loading from global device memory making the operations super quick. The only downside of this is that shared memory is limited and depeds on the device and block size, but will always be significantly less than 500*500. Therefore I think the scaling would drop off as soon as the shared memory is exhausted and threads have to load from global device memory.

I would have been happy if it was the same speed or faster. Thereby relieving the CPU of the workload, But this was not what I expected.

I am. you forget, we discuss this in terms of the kernel size k, which varies not just around 3, but up to 500 in this instance.

a box blur is a separable filter, and because its filter shape is so trivial, it can even be calculated from an Integral Image. from an integral image, arbitrary box sizes are always just four lookups per pixel, O(4) = O(1).

calculating an integral image is two cumulative sums. source → per-row rightward → per-column downward (unless I’m mistaken), done. that should be trivial to parallelize and cheap to calculate. it’s O(2 \cdot m \cdot n) = O(m n), for a total of O(2 m n + 4) = O(m n).

if the CUDA implementation doesn’t work with that information, it may just do a naive sum of the neighborhood for every pixel, which is O(k^2) per pixel or O(k^2 m n) total. that is an absolute waste of calculation.

different runtime complexity classes.

I don’t know what the CUDA box filter actually does under the hood. it may be that dumb, or it may be smarter, but I doubt that.

prediction: double the filter size, naive algorithm’s runtime will quadruple. filter based on integral image should stay the same.

of course, the runtime of the naive algorithm may be shorter than the one for the integral image for very small kernel sizes. in OpenCV there are a bunch of those situations, and the API call uses one or the other impl. depending on parameters and input sizes.

1 Like

Appologies, of course I wasn’t considering the type of filter. In that case as you said it should scale fairly well, probably really fast for small filters as less transfer from global memory and then only slightly slower for large filters as the transfer increases to a maximum of 4 times the threadblock size and you have to pre-compute the integral image.

This operation (both with the naive and integral image approach) should be completely memory bound, meaning that small filters which fit in shared memory, requiring less transfer from global memory should be as quick using the naive approach as the integral image.

Looking at the trace for the npp functions I would be 99% sure they are naive implementations, firstly due to the name of the kernels:

ForEachPixelNaive<float, (int)1, FilterBoxReplicateBorder3x3SharedFunctor
ForEachPixelNaive<float, (int)1, FilterBoxReplicateBorder5x5SharedFunctor

for filter size 3 and 5 respectively, which as the name suggests are both using shared memory and therefore probably the classic approach which is faster for small filters and then for filters of size 7 and above

ForEachPixelNaive<float, (int)1, FilterBoxReplicateBorderFloatFunctor

which doesn’t use shared memory, and at a guess from the timings for large filters uses the naive approach with global memory reads for each operation.

I’ve never looked into writing my own CUDA kernel for use with OpenCV… but at least there’s cv::cuda::integral so that should give a very nice basis for a very trivial kernel function

pseudocode:

def kernel(x, y, ii, k):
    return ii[y+k, x+k] + ii[y-k, x-k] - ii[y+k, x-k] - ii[y-k, x+k]

with appropriate border handling for the lookups (“replicate”/clip mode is suitable).

I wrote up a bug for the NVIDIA NPP…

These are the numbers I saw…

As you can see by my data the CPU version slows minimally based on kernel size but the GPU falls off a cliff in terms of speed after kernel size grows past 32.

At kernels size of 32x32 GPU and CPU are similar at larger CUDA falls apart. This is similar for many filter types.

Performance numbers using openCV CUDA performance test

Using CUDA (Via) NPP

[ RUN ] Sz_Type_KernelSz_Blur.Blur/55, where GetParam() = (1280x1024, 32FC1, 32)
[ PERFSTAT ] (samples=25 mean=2.37 median=2.34 min=2.33 stddev=0.07 (2.8%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/55 (67 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/56, where GetParam() = (1280x1024, 32FC1, 64)
[ PERFSTAT ] (samples=13 mean=8.72 median=8.68 min=8.38 stddev=0.25 (2.9%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/56 (120 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/57, where GetParam() = (1280x1024, 32FC1, 128)
[ PERFSTAT ] (samples=10 mean=33.78 median=33.61 min=33.25 stddev=0.60 (1.8%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/57 (343 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/58, where GetParam() = (1280x1024, 32FC1, 256)
[ PERFSTAT ] (samples=10 mean=132.97 median=132.77 min=131.68 stddev=0.77 (0.6%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/58 (1335 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/59, where GetParam() = (1280x1024, 32FC1, 512)
[ PERFSTAT ] (samples=10 mean=547.32 median=547.15 min=546.28 stddev=1.04 (0.2%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/59 (5479 ms)

Using CPU

[ RUN ] Sz_Type_KernelSz_Blur.Blur/55, where GetParam() = (1280x1024, 32FC1, 32)
[ PERFSTAT ] (samples=20 mean=2.16 median=2.13 min=2.12 stddev=0.06 (3.0%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/55 (47 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/56, where GetParam() = (1280x1024, 32FC1, 64)
[ PERFSTAT ] (samples=19 mean=2.22 median=2.19 min=2.19 stddev=0.07 (2.9%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/56 (45 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/57, where GetParam() = (1280x1024, 32FC1, 128)
[ PERFSTAT ] (samples=18 mean=2.66 median=2.64 min=2.57 stddev=0.08 (3.0%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/57 (51 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/58, where GetParam() = (1280x1024, 32FC1, 256)
[ PERFSTAT ] (samples=13 mean=3.48 median=3.48 min=3.38 stddev=0.06 (1.7%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/58 (48 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/59, where GetParam() = (1280x1024, 32FC1, 512)
[ PERFSTAT ] (samples=13 mean=4.66 median=4.65 min=4.55 stddev=0.08 (1.7%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/59 (64 ms)