Why OpenCV cuda function the execution time will be inconsistent?

Basically, I tried to run my opencv cuda project to compare the execution time between cpu and gpu. But when I read the log file, it shows the execution time for GPU have some delay after 6 times. I know that the very first CUDA kernel takes longer for some GPU initialization. So we can see for the very first time the execution time takes longest. My question is why the execution time after 6 times will be so much different? After series 6 the process time start to takes long time to execute the process. The function that I apply cuda is convert, blurring, dilation and morphological operation.

//My Opencv Cuda Code
LOG_INFO << "----------------------Process Program----------------------" << ENDLOG;
LOG_INFO << "Read Image - Start" << ENDLOG;
_imageCvMat = cv::imread(image_path);
LOG_INFO << "Read Image - END" << ENDLOG;

LOG_INFO << "Upload to GPU - Start" << ENDLOG;
_imageCvCudaMat.upload(_imageCvMat);
    
LOG_INFO << "isContinuous " << _imageCvCudaMat.isContinuous()<< ENDLOG;
LOG_INFO << "Upload to GPU - END" << ENDLOG;

LOG_INFO << "Convert Image Cuda - Start" << ENDLOG;
cv::cuda::cvtColor(_imageCvCudaMat, _grayCvCudaMat, cv::COLOR_BGR2GRAY);
LOG_INFO << "Convert Image Cuda - END " << ENDLOG;

LOG_INFO << "GPU to CPU gray image - Start" << ENDLOG;
_grayCvCudaMat.download(_grayCvMat);
LOG_INFO << "GPU to CPU gray image- END" << ENDLOG;

LOG_INFO << "Blurring Image Cuda - Start " << ENDLOG;
cv::Ptr<cv::cuda::Filter> gaussian_filter = cv::cuda::createGaussianFilter(_grayCvCudaMat.type(), _blurredCvCudaMat.type(), cv::Size(5, 5), 0);
gaussian_filter->apply(_grayCvCudaMat, _blurredCvCudaMat);
LOG_INFO << "Blurring Image Cuda - END " << ENDLOG;

LOG_INFO << "Threshold Image - Start " << ENDLOG;
cv::threshold(_grayCvMat, _threshCvmat, 0, 255, cv::THRESH_BINARY_INV | cv::THRESH_OTSU);
LOG_INFO << "Threshold Image - END " << ENDLOG;

LOG_INFO << "Distance Transform Image - Start " << ENDLOG;
cv::distanceTransform(_threshCvmat, _distCvMat, cv::DIST_L2, 5);
LOG_INFO << "Distance Transform Image - END " << ENDLOG;
    
// Normalize the distance transform such that the distances lie in
// the range [0, 1] and then convert the distance transform back to
// an unsigned 8-bit integer in the range [0, 255]
LOG_INFO << "Normalize Image - Start " << ENDLOG;
cv::normalize(_distCvMat, _distCvMat, 0, 12.0, cv::NORM_MINMAX, CV_32FC1);
_distCvMat = _distCvMat * 255;
_distCvMat.convertTo(_distCvMat, CV_8U);
LOG_INFO << "Normalize Image - END " << ENDLOG;

// Threshold the distance transform using Otsu's method
LOG_INFO << "Threshold Image - Start " << ENDLOG;
cv::threshold(_distCvMat, _distCvMat, 0, 255, cv::THRESH_BINARY_INV | cv::THRESH_OTSU);
LOG_INFO << "Threshold Image - END " << ENDLOG;

LOG_INFO << "Threshold CPU to GPU - Start " << ENDLOG;
_distCvCudaMat.upload(_distCvMat);
LOG_INFO << "Threshold CPU to GPU - END " << ENDLOG;

LOG_INFO << "Dilate Image - Start " << ENDLOG;
m_kernelT1CvMat = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(7,7));
cv::Ptr<cv::cuda::Filter> dilate_filter = cv::cuda::createMorphologyFilter(cv::MORPH_DILATE, _distCvCudaMat.type(), m_kernelT1CvMat, cv::Point(-1, -1), 1);
dilate_filter->apply(_distCvCudaMat, _dilateCvCudaMat);
LOG_INFO << "Dilate Image - END " << ENDLOG;

LOG_INFO << "Opening morphological Image - Start " << ENDLOG;
m_kernelT2CvMat = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(7, 7));
cv::Ptr<cv::cuda::Filter> open_filter = cv::cuda::createMorphologyFilter(cv::MORPH_OPEN, _dilateCvCudaMat.type(), m_kernelT2CvMat, cv::Point(-1, -1), 3);
open_filter->apply(_dilateCvCudaMat, _openningCvCudaMat);

crosspost:

Your timing the launch latency as well. Essentially you when you call a cuda function there is a delay until it is launched on the GPU. To time the kernel execution use cuda events.

First time your code with CUDA events, then if there is still a descrepency we can talk about possible causes. e.g. GPU boosting on mobile chips, context switching if you have a display conected etc. etc.

Note: When performing work on the GPU you want to launch as many jobs as possible in a row and overlap host and device computation so that you hide the launch latency.

i did not Initialize any kernel inside my code, i just direct call the OpenCV cuda function and time this function only. Is any documents or post can use for reference?

I am going to update the OpenCV CUDA docs when I have time as I haven’t found any easy to follow OpenCV specific documentation. That said the example below should hopefully give you some insight into what is going on

// time context creation
cv::TickMeter tm;
tm.start();
cuda::setDevice(0);
tm.stop();
cout << "Context creation time on host: " << std::setprecision(2) << tm.getTimeMilli() << " ms" << endl;

Size sz(1920, 1080);
// initialize both GpuMat's to avoid timing creation of gray in the first call
GpuMat bgr(sz, CV_8UC3), gray(sz, CV_8UC1);
cv::cuda::Event startEvent, endEvent;
cv::cuda::Stream stream; // use streams to avoid global synchronization
// time loading of cudaimgproc and color conversion kernel on host and device
tm.reset(); tm.start();
startEvent.record();
cv::cuda::cvtColor(bgr, gray, cv::COLOR_BGR2GRAY, 0, stream);
endEvent.record();
endEvent.waitForCompletion();
tm.stop();
const double elTimeAll = cuda::Event::elapsedTime(startEvent, endEvent);
cout << "Loading of cudaimgproc + cvtColor time - host: " << tm.getTimeMilli() << " ms, device: " << elTimeAll << " ms" << endl;

// time a few iterations to check assumptions
for (int i = 0; i < 10; i++) {
    tm.reset(); tm.start();
    startEvent.record();
    cv::cuda::cvtColor(bgr, gray, cv::COLOR_BGR2GRAY, 0, stream);
    endEvent.record();
    endEvent.waitForCompletion();
    tm.stop();
    // stream.waitForCompletion();
    const double elTimeCvt = cuda::Event::elapsedTime(startEvent, endEvent);
    cout << endl << i << ") cvtColor time - host: " << tm.getTimeMilli() << " ms, device: " << elTimeCvt << " ms" << endl;
    cout << "CUDA kernel launch latency: " << (tm.getTimeMilli() - elTimeCvt) * 1000 << " us" << endl;
}

The output from the above on my system is

Context creation time on host: 1.6e+03 ms
Loading of cudaimgproc + cvtColor time - host: 31 ms, device: 28 ms

  1. cvtColor time - host: 0.087 ms, device: 0.03 ms
    CUDA kernel launch latency: 57 us

  2. cvtColor time - host: 0.13 ms, device: 0.029 ms
    CUDA kernel launch latency: 1e+02 us

  3. cvtColor time - host: 0.13 ms, device: 0.029 ms
    CUDA kernel launch latency: 1e+02 us

  4. cvtColor time - host: 0.1 ms, device: 0.029 ms
    CUDA kernel launch latency: 73 us

  5. cvtColor time - host: 0.11 ms, device: 0.03 ms
    CUDA kernel launch latency: 76 us

  6. cvtColor time - host: 0.28 ms, device: 0.029 ms
    CUDA kernel launch latency: 2.5e+02 us

  7. cvtColor time - host: 0.26 ms, device: 0.031 ms
    CUDA kernel launch latency: 2.3e+02 us

  8. cvtColor time - host: 0.17 ms, device: 0.029 ms
    CUDA kernel launch latency: 1.4e+02 us

  9. cvtColor time - host: 0.2 ms, device: 0.028 ms
    CUDA kernel launch latency: 1.7e+02 us

  10. cvtColor time - host: 0.2 ms, device: 0.031 ms
    CUDA kernel launch latency: 1.7e+02 us

As you mentioned there are a number of initialization delays when using the CUDA modules these are the CUDA context creation and loading of the device code onto the GPU (CUDA < 12.2)

Context creation time on host: 1.6e+03 ms
Loading of cudaimgproc + cvtColor time - host: 31 ms, device: 28 ms

Once that’s out of the way I have timed the CUDA function, which calls CUDA device code (CUDA kernel), on the host and the device. The host timer measures everything from when you call cvtColor to when the CUDA kernel has finished whereas the device timer only measues the kernel execution time.

As you can see from the results the kernel execution time is fairly consistent ~0.03 ms where as the launch latency (time between instructing the device to execute your CUDA kernel and time it actually starts executing) can vary significantly. For completeness I should mention that all these timings are performed on windows (which uses he WDDM driver) which is why the launch latency varies so much. If you uncomment

    // stream.waitForCompletion();

on windows you should get much more consistent launch latency. That said I think it nicely demonstrates the point that you need to understand what you are timing.

I hope this helps. For even more information you can check out something I wrote years ago which includes a lot more detail.

Thanks ! Let me try and feedback the result

Hi cudawarped, I have tried the code ad. So for my program actually is also consistent. The inconsistent will happen is because the UI button which mean every click will depends how fast u click the button and it will cause the execution time.

Are you saying your inconsistent timings are due to a ui button?

2 Likes

Yap, if my clicking timing is consistent, then the time will be consistent. If I click slow, then the time will change also

Spam…

1 Like

just flag it, we’ll handle it.

for anyone wondering what we refer to: the spam post is invisible now.