Blur is not a member of cv::cuda

Appologies, of course I wasn’t considering the type of filter. In that case as you said it should scale fairly well, probably really fast for small filters as less transfer from global memory and then only slightly slower for large filters as the transfer increases to a maximum of 4 times the threadblock size and you have to pre-compute the integral image.

This operation (both with the naive and integral image approach) should be completely memory bound, meaning that small filters which fit in shared memory, requiring less transfer from global memory should be as quick using the naive approach as the integral image.

Looking at the trace for the npp functions I would be 99% sure they are naive implementations, firstly due to the name of the kernels:

ForEachPixelNaive<float, (int)1, FilterBoxReplicateBorder3x3SharedFunctor
ForEachPixelNaive<float, (int)1, FilterBoxReplicateBorder5x5SharedFunctor

for filter size 3 and 5 respectively, which as the name suggests are both using shared memory and therefore probably the classic approach which is faster for small filters and then for filters of size 7 and above

ForEachPixelNaive<float, (int)1, FilterBoxReplicateBorderFloatFunctor

which doesn’t use shared memory, and at a guess from the timings for large filters uses the naive approach with global memory reads for each operation.