Regarding the speed issue of Gaussian Blur

I tried using the Gaussian Blur and Cuda accelerated Gaussian filter functions in Open4.7, using sigma for 5 and 50 Gaussian kernels, and using the Chrono library for timing.

 auto start2 = std::chrono::high_resolution_clock::now();
    cv::Mat gause1(img.size(),CV_32FC1);
    auto end2 = std::chrono::high_resolution_clock::now();
    auto duration2 = std::chrono::duration_cast<std::chrono::microseconds >(end2-start2);
    cout << "GaussianBlur elapsed time: " << duration2.count() << "us\n";

      auto filter = cv::cuda::createGaussianFilter(CV_32FC1,CV_32FC1,cv::Size(31,31),5,5,cv::BORDER_REPLICATE);
    auto start4 = std::chrono::high_resolution_clock::now();
    cv::cuda::GpuMat src,dst;
    cv::Mat gause1_gpu;;
    auto end4 = std::chrono::high_resolution_clock::now();
    auto duration4 = std::chrono::duration_cast<std::chrono::microseconds >(end4-start4);
    cout << "GaussianBlur_gpu elapsed time: " << duration4.count() << "us\n";

GaussianBlur elapsed time: 9793us
GaussianBlur_gpu elapsed time: 171327us

  1. Why does CUDA acceleration take longer to run than on the CPU? I understand that cuda requires the use of events for timing, but if my algorithm involves cuda acceleration and I want to calculate the duration of the entire algorithm, should I use CPU time for timing?

  2. Secondly, why does Cuda’s accelerated Gaussian blur limit the Gaussian kernel size to only be greater than 0 and less than or equal to 32? If the sigma of the Gaussian kernel is 50, its Gaussian kernel size should be around 331. Does this mean that Cuda cannot be used to accelerate Gaussian blur with a sigma exceeding 5?

  3. Finally, if I want to apply Gaussian blur with sigma 5 and 70 to the same image, is there any way to reduce the runtime to 1-2ms? My picture is 240 * 340, CV_ 32FC1.

I can think of two possible causes:

  1. You are timing the upload and download of the image to the device not just the filter operation.
  2. You are using a large window, from memory window sizes up to 7 are processed in shared memory.

If you use CPU timers then you will be including the latency of the kernel launch in your timing. e.g. if the latency is 100 micro seconds and your kernel only takes 32 micro seconds the CPU timer will be > 132 when the event will be ~32.

CUDA also has a start up cost when the context is initailized (the first time you call a function from the OpenCV cuda SDK) and on the first call to a specific function. To remove this call filter first without timing it, or time it twice to see the difference.