Blur is not a member of cv::cuda

are you measuring a single iteration of this ?
(kernels need to be compiled, caches warmed, etc)

are there more gpu ops in your pipeline ?
(up/downloading between cpu/gpu is expensive)

This link was original…

In there they used the old name…
cv::gpu::blur(gpuImg0, gpuImage0Blurred, cv::Size(7, 7), cv::Point(-1, -1), stream);

This is some more details…
The non-CUDA version of cv::blur took ~5ms
The CUDA version (blurfilter->apply) took ~450 ms.

The images are 720x1280 CV_32FC1

how exactly do you measure this?

a “simple call” to a CUDA function (that runs on the GPU) is not comparable to a call to code that runs on the CPU.

The post refers to a gpu which is 3 generations older than yours which is back of the envelope nearly 7 times slower. If you read the comments below someone with a faster gpu 770 achieved faster times and proposed that the memory bandwidth was the issue. Whilst I am not convinced that it is just the memory bandwidth in his case I would suggest it is the GPU performance.

Anyway to put that post in context using his timings ~12ms you can see that unless something in the codebase has changed for the worse ~450ms is way out (even if the image type is different), that is with a 7 times faster gpu and a possibly 9 times smaller image.

As @crackwitz mentioned and the post you linked to (first run 1.7 secs vs 12 ms) you are timing the first run (one time only cost) on the gpu where initialization including the creation of the cuda context happens. This is always orders of magnitude slower than subsiquent operations. Additionaly if you pass an empty GpuMat as the destination that memory will also get allocated during the call slowing things down even more.

Everyone thankyou for the ideas. All are appreciated.

The timing I reported was not on the first iterations and did not include the GpuMat allocation only the running of the filter…

I was looking for what the closest equivalent to the “simple blur” clearly the box filter is not.

The test data is the same data on GPU vs CPU. I wanted apple to apples comparison.,

I know the older link I found was wrong. but I was hoping for similar results with 3.4.9.

That is really strange. Have you built opencv with the performance tests?

I just checked on the perf test which uses a 7x7 boxfilter on a 1280x1024 32FC1 image

opencv_perf_cudafilters.exe --gtest_filter=Sz_Type_KernelSz_Blur.Blur/17

and the output was

[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Sz_Type_KernelSz_Blur
[ RUN ] Sz_Type_KernelSz_Blur.Blur/17, where GetParam() = (1280x1024, 32FC1, 7)
[ PERFSTAT ] (samples=100 mean=0.92 median=0.93 min=0.86 stddev=0.04 (4.8%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/17 (185 ms)
[----------] 1 test from Sz_Type_KernelSz_Blur (188 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (190 ms total)
[ PASSED ] 1 test.

This takes 0.92 ms on a GTX 1060. Can you compare the times you get for this to check they are quicker?

1 Like

So I rebuilt opencv and got the following…

[ RUN ] Sz_Type_KernelSz_Blur.Blur/6, where GetParam() = (1280x720, 32FC1, 3)
[ PERFSTAT ] (samples=38 mean=0.15 median=0.15 min=0.14 stddev=0.00 (2.2%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/6 (35 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/7, where GetParam() = (1280x720, 32FC1, 5)
[ PERFSTAT ] (samples=13 mean=0.17 median=0.17 min=0.16 stddev=0.00 (1.1%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/7 (31 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/8, where GetParam() = (1280x720, 32FC1, 7)
[ PERFSTAT ] (samples=13 mean=0.27 median=0.27 min=0.26 stddev=0.00 (1.0%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/8 (32 ms)

So, it it something about the opencv library build.

Cool, so is your code also quicker now as well?

I have to relink to the new library

I tested with a the opencv_perf_cudafilters and got really good numbers well below a ms.

But, when I link against those same libraries I get hundreds of ms to do the same blur.

Debugging

Have you checked using the exact same parameters, sometimes it can make a big difference? It still sounds like the way you measure the timing may be out, hundreds of ms sounds a lot like one of the initialization runs, and or a bug in the way you are calculating your timing.

Not a bug in the timing calculation. It is a mistake in understanding the cuda implementation vs non-cuda.

For kernels of less then 32x32 cuda implementation is faster once you get above 32x32 (in my case the test I had was 500x 500 the CUDA implementation is way slower.

I modified the opencv performance tests.

500x500 CUDA blur (boxfilter) is…
[ RUN ] Sz_Type_KernelSz_Blur.Blur/23, where GetParam() = (1280x720, 32FC1, 500)
[ PERFSTAT ] (samples=10 mean=347.66 median=347.28 min=346.61 stddev=0.99 (0.3%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/23 (3480 ms)

By comparison blur non-CUDA…

[ RUN ] Size_MatType_BorderType_blur500x500.blur500x500/39, where GetParam() = (1280x720, 32FC1, BORDER_REFLECT101)
[ PERFSTAT ] (samples=10 mean=3.53 median=3.50 min=3.41 stddev=0.10 (2.9%))
[ OK ] Size_MatType_BorderType_blur500x500.blur500x500/39 (38 ms)

This was my confusion I was using a 500x500 kernel in my code so now I am doing an apples to apples comparison and realize the the CUDA difference vs non-CUDA.

The other minor note I have a RTX 2060 not an RTX 2070.

Thanks for everyone’s help. But it looks like I tried to move the wrong bit of code into the GPU. We often have large kernels.

1 Like

a simple box blur can be implemented to scale the same on a GPU as on a CPU.

a general filter (with interesting values in the kernel) is more complex than a box filter (filter is all ones, scaled).

I don’t know who wrote what code for CUDA there. if you need this to run fast, it’s possible.

I am not sure that this would be the case. Although I can’t see the implementation for boxfilter because it uses npp under the hood, the normal procedure with CUDA for a filter of size 3 would be to load a neighbourhood of pixels say 33x33 into shared memory (user managed cache) which can be accessed by say 32x32 threads. Each thread can then access the region the filter operates on from the cache instead of loading from global device memory making the operations super quick. The only downside of this is that shared memory is limited and depeds on the device and block size, but will always be significantly less than 500*500. Therefore I think the scaling would drop off as soon as the shared memory is exhausted and threads have to load from global device memory.

I would have been happy if it was the same speed or faster. Thereby relieving the CPU of the workload, But this was not what I expected.

I am. you forget, we discuss this in terms of the kernel size k, which varies not just around 3, but up to 500 in this instance.

a box blur is a separable filter, and because its filter shape is so trivial, it can even be calculated from an Integral Image. from an integral image, arbitrary box sizes are always just four lookups per pixel, O(4) = O(1).

calculating an integral image is two cumulative sums. source → per-row rightward → per-column downward (unless I’m mistaken), done. that should be trivial to parallelize and cheap to calculate. it’s O(2 \cdot m \cdot n) = O(m n), for a total of O(2 m n + 4) = O(m n).

if the CUDA implementation doesn’t work with that information, it may just do a naive sum of the neighborhood for every pixel, which is O(k^2) per pixel or O(k^2 m n) total. that is an absolute waste of calculation.

different runtime complexity classes.

I don’t know what the CUDA box filter actually does under the hood. it may be that dumb, or it may be smarter, but I doubt that.

prediction: double the filter size, naive algorithm’s runtime will quadruple. filter based on integral image should stay the same.

of course, the runtime of the naive algorithm may be shorter than the one for the integral image for very small kernel sizes. in OpenCV there are a bunch of those situations, and the API call uses one or the other impl. depending on parameters and input sizes.

1 Like

Appologies, of course I wasn’t considering the type of filter. In that case as you said it should scale fairly well, probably really fast for small filters as less transfer from global memory and then only slightly slower for large filters as the transfer increases to a maximum of 4 times the threadblock size and you have to pre-compute the integral image.

This operation (both with the naive and integral image approach) should be completely memory bound, meaning that small filters which fit in shared memory, requiring less transfer from global memory should be as quick using the naive approach as the integral image.

Looking at the trace for the npp functions I would be 99% sure they are naive implementations, firstly due to the name of the kernels:

ForEachPixelNaive<float, (int)1, FilterBoxReplicateBorder3x3SharedFunctor
ForEachPixelNaive<float, (int)1, FilterBoxReplicateBorder5x5SharedFunctor

for filter size 3 and 5 respectively, which as the name suggests are both using shared memory and therefore probably the classic approach which is faster for small filters and then for filters of size 7 and above

ForEachPixelNaive<float, (int)1, FilterBoxReplicateBorderFloatFunctor

which doesn’t use shared memory, and at a guess from the timings for large filters uses the naive approach with global memory reads for each operation.

I’ve never looked into writing my own CUDA kernel for use with OpenCV… but at least there’s cv::cuda::integral so that should give a very nice basis for a very trivial kernel function

pseudocode:

def kernel(x, y, ii, k):
    return ii[y+k, x+k] + ii[y-k, x-k] - ii[y+k, x-k] - ii[y-k, x+k]

with appropriate border handling for the lookups (“replicate”/clip mode is suitable).

I wrote up a bug for the NVIDIA NPP…

These are the numbers I saw…

As you can see by my data the CPU version slows minimally based on kernel size but the GPU falls off a cliff in terms of speed after kernel size grows past 32.

At kernels size of 32x32 GPU and CPU are similar at larger CUDA falls apart. This is similar for many filter types.

Performance numbers using openCV CUDA performance test

Using CUDA (Via) NPP

[ RUN ] Sz_Type_KernelSz_Blur.Blur/55, where GetParam() = (1280x1024, 32FC1, 32)
[ PERFSTAT ] (samples=25 mean=2.37 median=2.34 min=2.33 stddev=0.07 (2.8%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/55 (67 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/56, where GetParam() = (1280x1024, 32FC1, 64)
[ PERFSTAT ] (samples=13 mean=8.72 median=8.68 min=8.38 stddev=0.25 (2.9%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/56 (120 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/57, where GetParam() = (1280x1024, 32FC1, 128)
[ PERFSTAT ] (samples=10 mean=33.78 median=33.61 min=33.25 stddev=0.60 (1.8%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/57 (343 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/58, where GetParam() = (1280x1024, 32FC1, 256)
[ PERFSTAT ] (samples=10 mean=132.97 median=132.77 min=131.68 stddev=0.77 (0.6%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/58 (1335 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/59, where GetParam() = (1280x1024, 32FC1, 512)
[ PERFSTAT ] (samples=10 mean=547.32 median=547.15 min=546.28 stddev=1.04 (0.2%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/59 (5479 ms)

Using CPU

[ RUN ] Sz_Type_KernelSz_Blur.Blur/55, where GetParam() = (1280x1024, 32FC1, 32)
[ PERFSTAT ] (samples=20 mean=2.16 median=2.13 min=2.12 stddev=0.06 (3.0%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/55 (47 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/56, where GetParam() = (1280x1024, 32FC1, 64)
[ PERFSTAT ] (samples=19 mean=2.22 median=2.19 min=2.19 stddev=0.07 (2.9%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/56 (45 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/57, where GetParam() = (1280x1024, 32FC1, 128)
[ PERFSTAT ] (samples=18 mean=2.66 median=2.64 min=2.57 stddev=0.08 (3.0%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/57 (51 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/58, where GetParam() = (1280x1024, 32FC1, 256)
[ PERFSTAT ] (samples=13 mean=3.48 median=3.48 min=3.38 stddev=0.06 (1.7%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/58 (48 ms)
[ RUN ] Sz_Type_KernelSz_Blur.Blur/59, where GetParam() = (1280x1024, 32FC1, 512)
[ PERFSTAT ] (samples=13 mean=4.66 median=4.65 min=4.55 stddev=0.08 (1.7%))
[ OK ] Sz_Type_KernelSz_Blur.Blur/59 (64 ms)