OpenCV CUDA extremely slow

kbarni · April 29, 2021, 3:44pm

I made some tests comparing OpenCV performance with some basic operations with or without CUDA.

I just threw in a few simple operators: greyscale conversion, thresholding, morphological operators, resizing.

To my surprise, the CUDA code was 50-60 times slower than the CPU!!! I tested on my laptop (core i7 vs GeForce MX130) and on a Nvidia Nano (ARM CPU) with similar results. CUDA code took 0.6 sec on my laptop, which is really a lot for a 5MP image.
CUDA 10.1/10.2 was used, and OpenCV 4.5.2 was compiled locally in both cases.
C++ and Python code gave similar performances in both cases.

Do you have any idea what am I doing wrong?

Here is my code for testing:

im=cv2.imread(filename)

# ****  CPU implementation  ****
start_t = time.time()
gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
retval,thr = cv2.threshold(gray,128,255,cv2.THRESH_BINARY)

morph_kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(7,7))
morph = cv2.dilate(thr,morph_kernel)
morph = cv2.resize(morph,(640,480))
end_t = time.time()
print("Processing time : {}".format(end_t-start_t))

# ****  GPU implementation  ****
start_t = time.time()
gpu_frame = cv2.cuda_GpuMat()
gpu_frame.upload(im)
gpu_gray = cv2.cuda.cvtColor(gpu_frame, cv2.COLOR_BGR2GRAY)
retval,gpu_thr = cv2.cuda.threshold(gpu_gray,128,255,cv2.THRESH_BINARY)
morph_kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(7,7))
morph_filter = cv2.cuda.createMorphologyFilter(cv2.MORPH_DILATE,cv2.CV_8U,morph_kernel)
gpu_morph = morph_filter.apply(gpu_thr)
gpu_morph = cv2.cuda.resize(gpu_morph,(640,480))
res = gpu_morph.download()
end_t = time.time()
print("Processing time : {}".format(end_t-start_t))

cudawarped · April 29, 2021, 4:53pm

I would say your speed problem is a combination of the following in order of importance:

Your GPU is very very low spec. You have 913.2 GF/s and a bandwidth of 40.1 GB/s. To put that into perspective a RTX 3090 has a boost of 35581 GF/s and a bandwidth of 936 GB/s.
You are including the upload and download of the frame to the device and CPU code in your timing, which is not timing the device. The upload and download will be dependent on the motherboard and its connection to your device, not the device itself. I would imply based on the GPU spec this will be extremely slow. You can overcome these delays to some extent by using streams.
You need to do several runs, usually with CUDA there is a start up cost as the device initializes with the first run being orders of magnitude slower than the subsequent ones.
As you are using python you are allocating a new cv2.cuda_GpuMat() for every function call, which has an overhead. Because you are doing

gpu_gray = cv2.cuda.cvtColor(gpu_frame, cv2.COLOR_BGR2GRAY)

instead of pre-allocating

gpu_gray = cv2.cuda_GpuMat(im.shape[:-1][::-1],cv2.CV_8UC1)

at the beginning (before any timing) and then calling your functions with the return argument as

cv2.cuda.cvtColor(gpu_frame, cv2.COLOR_BGR2GRAY,gpu_grey)

As your GPU only has a bandwidth of 40.1 GB’s this effect will be a far more pronounced on your system as apposed to say an RTX 2060.
You are processing everything in the default stream so every function call will stall the device on completion. If you move streams you also need to move your host call to

morph_kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(7,7))

and

morph_filter = cv2.cuda.createMorphologyFilter(cv2.MORPH_DILATE,cv2.CV_8U,morph_kernel)

above your GPU computation and timing.
You are using CPU timers instead of CUDA events, although this will probably have minimum impact.

On a side note when the python CUDA bindings were created I tested them out to see if streaming etc. was still possible and found the results on the GPU to be a significant improvement of the CPU.
https://jamesbowley.co.uk/accelerating-opencv-with-cuda-streams-in-python/

cudawarped · April 30, 2021, 10:15am

I ran a quick check of your code to check my assumptions and on my hardware at least the speedup from a desktop i7-8700 to a laptop RTX2080 which are roughly comparable ranges of CPU and GPU was ~3.5x. That was with a large array, with smaller arrays the speed increase will be lower as the GPU will not be saturated unless multiple pipelines are steamed at the same time.

https://github.com/cudawarped/opencv-experiments/blob/master/nbs/cuda_optimization_test.ipynb

kbarni · April 30, 2021, 12:32pm

Thanks for the tips, @cudawarped !

The main problem with my code was the slow startup time of the GPU. If I launch the code on a set of images, the first one takes long time, the rest is much (like ~30 times) faster.

In my case the other optimizations had little impact on performance - that said, these are good practices on CUDA programming (like pre-allocating GPU memory for images) and they can have impact on larger operations…

Topic		Replies	Views
OpenCV Optical Flow Cuda Naiva Implementation Slower then CPU Python cuda	3	413	April 4, 2024
CUDA Fast detector much slower than normal FAST performance , cuda , practical	9	2472	May 28, 2021
Some opencv cudafilter functions is slower than CPU code on Jetson Xavier NX C++ filter , cuda , cudaarithm	1	314	November 8, 2023
CUDA: SIFT or SURF, disappointed by execution timings cuda	6	3602	December 29, 2022
Why OpenCV cuda function the execution time will be inconsistent? C++ cuda , imgproc	10	1153	August 11, 2023

OpenCV CUDA extremely slow

Related topics