Some opencv cudafilter functions is slower than CPU code on Jetson Xavier NX

tenglike · November 8, 2023, 7:07am

I have tested some cudafilter functions but found it slower than CPU code on Jetson Xavier NX. The opencv version is 4.5.0. CUDA version is 10.2. The code as follows:

const int iternum = 100;

Mat img = imread("../../LenaGRAY.bmp", 0);
resize(img, img, Size(8000, 4000));
Mat element = getStructuringElement(MORPH_RECT, Size(15, 15));

Mat dst0 = Mat::zeros(img.size(), CV_8UC1);
Mat img0 = img.clone();
double t0_start = getTickCount();
for(int i = 0; i < iternum; i++)
{
    boxFilter(img0, dst0, CV_8UC1, Size(10, 10));
    //GaussianBlur(img0, dst0, CV_8UC1, Size(5, 5), 0);
    //morphologyEx(img0, dst0, MORPH_DILATE, element);
}
double t0_end = getTickCount();
cout<<"time:"<<(t0_end - t0_start) / getTickFrequency() * 1000 / iternum << "ms"<<endl;

cuda::GpuMat imgGpu;
imgGpu.upload(img);
cuda::GpuMat dstGpu;
dstGpu.create(img.size(), CV_8UC1);
Mat dst1;
double t1_start = getTickCount();
for(int i = 0; i < iternum; i++)
{
    cv::Ptr<cv::cuda::Filter>f1 = cv::cuda::createBoxFilter(CV_8UC1, CV_8UC1, Size(10, 10));
    f1->apply(imgGpu, dstGpu);
    //cv::Ptr<cv::cuda::Filter>f2 = cv::cuda::createGaussianFilter(CV_8UC1, CV_8UC1, Size(5, 5), 0);
    //f2->apply(imgGpu, dstGpu);
    //cv::Ptr<cv::cuda::Filter>f3 = cv::cuda::createMorphologyFilter(MORPH_DILATE,  CV_8UC1, element);
    //f3->apply(imgGpu, dstGpu);
}
double t1_end = getTickCount();
cout<<"time:"<<(t1_end - t1_start) / getTickFrequency() * 1000 / iternum << "ms"<<endl;
dstGpu.download(dst1);

I have tested boxFilter GaussianBlur and morpholoyEx, cuda functions is slower than on CPU.
I have also tested on Windows 10 PC. The GPU is NVIDIA GeForce RTX 3060, and CUDA version is 11.1. CUDA filter functions are still slower than CPU.

cudawarped · November 8, 2023, 7:21am

There’s a number of possible reasons:

You are including the timing of the filter creation - remove this and time again.
You may be timing the initialization overhead of loading cudafilters or opencv world libs - check the effect of calling your CUDA functions once outside of the timing loop first.
You are using a filter size which cannot be processed in shared memory - check the effect of reducing it to 5.
Your using CPU timers and the default stream so the resulting times will include the launch latency and some overhead for the stall resulting from the internal calls to cudaDeviceSynchronize - time with CUDA events to get the kernel execution time.
Its slower on your hardware.

Topic		Replies	Views
OpenCV CUDA extremely slow cuda	3	6604	April 30, 2021
Regarding the speed issue of Gaussian Blur C++ cuda , imgproc	1	450	December 14, 2023
CUDA Fast detector much slower than normal FAST performance , cuda , practical	9	2413	May 28, 2021
OpenCV Optical Flow Cuda Naiva Implementation Slower then CPU Python cuda	3	371	April 4, 2024
Cuda median filer performance C++ cuda	6	67	February 28, 2025

Some opencv cudafilter functions is slower than CPU code on Jetson Xavier NX

Related topics