Regarding the speed issue of Gaussian Blur

XRz · December 14, 2023, 7:37am

I tried using the Gaussian Blur and Cuda accelerated Gaussian filter functions in Open4.7, using sigma for 5 and 50 Gaussian kernels, and using the Chrono library for timing.

 auto start2 = std::chrono::high_resolution_clock::now();
    cv::Mat gause1(img.size(),CV_32FC1);
    cv::GaussianBlur(img,gause1,cv::Size(31,31),5,5,cv::BORDER_REPLICATE);
    auto end2 = std::chrono::high_resolution_clock::now();
    auto duration2 = std::chrono::duration_cast<std::chrono::microseconds >(end2-start2);
    cout << "GaussianBlur elapsed time: " << duration2.count() << "us\n";


      auto filter = cv::cuda::createGaussianFilter(CV_32FC1,CV_32FC1,cv::Size(31,31),5,5,cv::BORDER_REPLICATE);
    auto start4 = std::chrono::high_resolution_clock::now();
    cv::cuda::GpuMat src,dst;
    src.upload(img);
    filter->apply(src,dst);
    cv::Mat gause1_gpu;
    dst.download(gause1_gpu);
    auto end4 = std::chrono::high_resolution_clock::now();
    auto duration4 = std::chrono::duration_cast<std::chrono::microseconds >(end4-start4);
    cout << "GaussianBlur_gpu elapsed time: " << duration4.count() << "us\n";

GaussianBlur elapsed time: 9793us
GaussianBlur_gpu elapsed time: 171327us

Why does CUDA acceleration take longer to run than on the CPU? I understand that cuda requires the use of events for timing, but if my algorithm involves cuda acceleration and I want to calculate the duration of the entire algorithm, should I use CPU time for timing?
Secondly, why does Cuda’s accelerated Gaussian blur limit the Gaussian kernel size to only be greater than 0 and less than or equal to 32? If the sigma of the Gaussian kernel is 50, its Gaussian kernel size should be around 331. Does this mean that Cuda cannot be used to accelerate Gaussian blur with a sigma exceeding 5?
Finally, if I want to apply Gaussian blur with sigma 5 and 70 to the same image, is there any way to reduce the runtime to 1-2ms? My picture is 240 * 340, CV_ 32FC1.

cudawarped · December 14, 2023, 12:27pm

I can think of two possible causes:

You are timing the upload and download of the image to the device not just the filter operation.
You are using a large window, from memory window sizes up to 7 are processed in shared memory.

If you use CPU timers then you will be including the latency of the kernel launch in your timing. e.g. if the latency is 100 micro seconds and your kernel only takes 32 micro seconds the CPU timer will be > 132 when the event will be ~32.

CUDA also has a start up cost when the context is initailized (the first time you call a function from the OpenCV cuda SDK) and on the first call to a specific function. To remove this call filter first without timing it, or time it twice to see the difference.

Topic		Replies	Views
Blur is not a member of cv::cuda C++ cuda	28	3843	September 29, 2021
Some opencv cudafilter functions is slower than CPU code on Jetson Xavier NX C++ filter , cuda , cudaarithm	1	320	November 8, 2023
Why OpenCV cuda function the execution time will be inconsistent? C++ cuda , imgproc	10	1164	August 11, 2023
OpenCV CUDA extremely slow cuda	3	6834	April 30, 2021
CUDA Fast detector much slower than normal FAST performance , cuda , practical	9	2493	May 28, 2021

Regarding the speed issue of Gaussian Blur

Related topics