OpenCV CUDA extremely slow

I made some tests comparing OpenCV performance with some basic operations with or without CUDA.

I just threw in a few simple operators: greyscale conversion, thresholding, morphological operators, resizing.

To my surprise, the CUDA code was 50-60 times slower than the CPU!!! I tested on my laptop (core i7 vs GeForce MX130) and on a Nvidia Nano (ARM CPU) with similar results. CUDA code took 0.6 sec on my laptop, which is really a lot for a 5MP image.
CUDA 10.1/10.2 was used, and OpenCV 4.5.2 was compiled locally in both cases.
C++ and Python code gave similar performances in both cases.

Do you have any idea what am I doing wrong?

Here is my code for testing:


# ****  CPU implementation  ****
start_t = time.time()
gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
retval,thr = cv2.threshold(gray,128,255,cv2.THRESH_BINARY)

morph_kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(7,7))
morph = cv2.dilate(thr,morph_kernel)
morph = cv2.resize(morph,(640,480))
end_t = time.time()
print("Processing time : {}".format(end_t-start_t))

# ****  GPU implementation  ****
start_t = time.time()
gpu_frame = cv2.cuda_GpuMat()
gpu_gray = cv2.cuda.cvtColor(gpu_frame, cv2.COLOR_BGR2GRAY)
retval,gpu_thr = cv2.cuda.threshold(gpu_gray,128,255,cv2.THRESH_BINARY)
morph_kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(7,7))
morph_filter = cv2.cuda.createMorphologyFilter(cv2.MORPH_DILATE,cv2.CV_8U,morph_kernel)
gpu_morph = morph_filter.apply(gpu_thr)
gpu_morph = cv2.cuda.resize(gpu_morph,(640,480))
res =
end_t = time.time()
print("Processing time : {}".format(end_t-start_t))
1 Like

I would say your speed problem is a combination of the following in order of importance:

  1. Your GPU is very very low spec. You have 913.2 GF/s and a bandwidth of 40.1 GB/s. To put that into perspective a RTX 3090 has a boost of 35581 GF/s and a bandwidth of 936 GB/s.

  2. You are including the upload and download of the frame to the device and CPU code in your timing, which is not timing the device. The upload and download will be dependent on the motherboard and its connection to your device, not the device itself. I would imply based on the GPU spec this will be extremely slow. You can overcome these delays to some extent by using streams.

  3. You need to do several runs, usually with CUDA there is a start up cost as the device initializes with the first run being orders of magnitude slower than the subsequent ones.

  4. As you are using python you are allocating a new cv2.cuda_GpuMat() for every function call, which has an overhead. Because you are doing

    gpu_gray = cv2.cuda.cvtColor(gpu_frame, cv2.COLOR_BGR2GRAY)

    instead of pre-allocating

    gpu_gray = cv2.cuda_GpuMat(im.shape[:-1][::-1],cv2.CV_8UC1)

    at the beginning (before any timing) and then calling your functions with the return argument as

    cv2.cuda.cvtColor(gpu_frame, cv2.COLOR_BGR2GRAY,gpu_grey)

    As your GPU only has a bandwidth of 40.1 GB’s this effect will be a far more pronounced on your system as apposed to say an RTX 2060.

  5. You are processing everything in the default stream so every function call will stall the device on completion. If you move streams you also need to move your host call to

    morph_kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(7,7))


    morph_filter = cv2.cuda.createMorphologyFilter(cv2.MORPH_DILATE,cv2.CV_8U,morph_kernel)

    above your GPU computation and timing.

  6. You are using CPU timers instead of CUDA events, although this will probably have minimum impact.

On a side note when the python CUDA bindings were created I tested them out to see if streaming etc. was still possible and found the results on the GPU to be a significant improvement of the CPU.


I ran a quick check of your code to check my assumptions and on my hardware at least the speedup from a desktop i7-8700 to a laptop RTX2080 which are roughly comparable ranges of CPU and GPU was ~3.5x. That was with a large array, with smaller arrays the speed increase will be lower as the GPU will not be saturated unless multiple pipelines are steamed at the same time.

1 Like

Thanks for the tips, @cudawarped !

The main problem with my code was the slow startup time of the GPU. If I launch the code on a set of images, the first one takes long time, the rest is much (like ~30 times) faster.

In my case the other optimizations had little impact on performance - that said, these are good practices on CUDA programming (like pre-allocating GPU memory for images) and they can have impact on larger operations…