Python CUDA GpuMat upload() function, strange warm-up required?

Hi! I was trying to compare the execution times of some cv2 operations (e.g. Canny edge detection, HoughLines) between standard CPU version and CUDA-enabled versions of the opencv library. But during the process, the two findings about cv2.cuda_GpuMat.upload() were confusing for me and I hope to get some help to understand them better.

(The test script is attached below in Appendix A, and the environment information, e.g. GPU device, OS, python/cuda versions, are in Appendix B).

Finding 1

In the first 2 experiments, GpuMat.upload() seemed to involve some heavy overhead.
By running the comparison over the same 10 randomly generated images, but once with data size (480, 640), and once with (2, 2), it took about the same time for the GpuMat.upload() calls across different image sizes.

And here’re the related experiment results (1.1 and 1.2), followed by questions.

Experiment 1.1 (realistic image size)

# python3 
Creating [10] random test images of size: (480, 640)

======================================== CPU ========================================
         20 function calls in 0.142 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    0.142    0.014    0.142    0.014 {Canny}
       10    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

======================================== GPU (CUDA) ========================================
         40 function calls in 3.197 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    0.160    0.016    0.160    0.016 {method 'detect' of 'cv2.cuda_CannyEdgeDetector' objects}
       10    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       10    0.000    0.000    0.000    0.000 {method 'download' of 'cv2.cuda_GpuMat' objects}
       10    3.037    0.304    3.037    0.304 {method 'upload' of 'cv2.cuda_GpuMat' objects}

Experiment 1.2 (very small size)

Creating [10] random test images of size: (2, 2)

======================================== CPU ========================================
         20 function calls in 0.001 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    0.001    0.000    0.001    0.000 {Canny}
       10    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

======================================== GPU (CUDA) ========================================
         40 function calls in 3.141 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    0.010    0.001    0.010    0.001 {method 'detect' of 'cv2.cuda_CannyEdgeDetector' objects}
       10    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       10    0.000    0.000    0.000    0.000 {method 'download' of 'cv2.cuda_GpuMat' objects}
       10    3.131    0.313    3.131    0.313 {method 'upload' of 'cv2.cuda_GpuMat' objects}

Questions to Finding 1

After seeing the above results, I checked the source code for GpuMat for any operation that could add an overhead regardless of data shape.

I was suspecting the create(... call in the upload() function caused the overhead, with releasing and allocating memory. But the create() function does try to avoid any work if the data shape and type are the identical (line 160), which is true in my case.

So the questions are:

  1. Is there anything wrong with the test script?
  2. Otherwise where do these overhead come from?
    (Please do read the next finding, too. It overrules this finding of overheads.)

Finding 2

Having a global call to any cv2.cuda_GpuMat object’s upload() function before the tests start, solves the overhead problem. In particular, this line (near the top of the test script below) was uncommented:

# cv2.cuda_GpuMat().upload(np.random.randint(0, 256, G_DATA_SHAPE, dtype=np.uint8))

Here’s the related experiment result followed by questions.

Experiment 2

# python3 
Creating [10] random test images of size: (480, 640)

======================================== CPU ========================================
         20 function calls in 0.142 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    0.142    0.014    0.142    0.014 {Canny}
       10    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

======================================== GPU (CUDA) ========================================
         40 function calls in 0.240 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    0.229    0.023    0.229    0.023 {method 'detect' of 'cv2.cuda_CannyEdgeDetector' objects}
       10    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       10    0.000    0.000    0.000    0.000 {method 'download' of 'cv2.cuda_GpuMat' objects}
       10    0.011    0.001    0.011    0.001 {method 'upload' of 'cv2.cuda_GpuMat' objects}

Questions to Finding 2

  1. The object on the uncommented line, created with cv2.cuda_GpuMat() was not even used in subsequent script lines. How does this affect the later use of d_src.upload(img)? Is there some kind of singleton behind all GpuMats?
  2. Without that line, d_src.upload(img) was also called multiple times throughout the for-loop. Why doesn’t that have the same effect?
  3. The script’s workflow of cv2.cuda...seems common after some research online. Suppose the experiments were conducted correctly and can be reproduced, what’s the proper way of “warming up” the GpuMat’s upload() function?


I’d appreciate it for any comments, hints or directions on these bizarre findings. It would be amazing to understand and find a reasonable way to warm up the upload part. Thank you in advance!

Appendix A: Profiling script

Simple script to compare cpu and cuda versions of CannyEdgeDetector in opencv                                                                                                                                                                                                               
What happens: some random images (of the same size) are created, and both versions of the detector run on each image. The times of the function calls are profiled.                                                                                                                         
* Change the randomly created image size with variable G_DATA_SHAPE;                                                                                                                                                                                                                        
* Change the number of randomly created images with variable NUM_TEST_IMAGES;                                                                                                                                                                                                               
import numpy as np                                                                                                                                                                                                                                                                          
import cv2                                                                                                                                                                                                                                                                                  
import cProfile, pstats                                                                                                                                                                                                                                                                     
G_DATA_SHAPE = (480, 640)                                                                                                                                                                                                                                                                   
NUM_TEST_IMAGES = 10                                                                                                                                                                                                                                                                        
# cv2.cuda_GpuMat().upload(np.random.randint(0, 256, G_DATA_SHAPE, dtype=np.uint8))                                                                                                                                                                                                         
prof_cpu = cProfile.Profile()                                                                                                                                                                                                                                                               
prof_gpu = cProfile.Profile()                                                                                                                                                                                                                                                               
def create_data(n_images: int):                                                                                                                                                                                                                                                             
    print(f"Creating [{n_images}] random test images of size: {G_DATA_SHAPE}\n")                                                                                                                                                                                                            
    data = []                                                                                                                                                                                                                                                                               
    for _ in range(n_images):                                                                                                                                                                                                                                                               
        # Create synthetic grayscale image data                                                                                                                                                                                                                                             
        image_data = np.random.randint(0, 256, G_DATA_SHAPE, dtype=np.uint8)                                                                                                                                                                                                                
    return data                                                                                                                                                                                                                                                                             
# Initialize CUDA Canny detector
cuda_canny = cv2.cuda.createCannyEdgeDetector(100, 200)
# containers for cuda data
d_src = cv2.cuda_GpuMat()
d_dst = cv2.cuda_GpuMat()

def fn_cpu_canny(img):

    edges = cv2.Canny(img, 100, 200)


def fn_cuda_canny(img):

    # detect and obtain results as np array
    cuda_canny.detect(d_src, d_dst)
    result =


def test_random_and_gather_stats(n_images: int):
    1. create n random test images with shape G_DATA_SHAPE
    2. run the cpu and cuda canny detector (with upload/download)
    3. print the stats of function calls

    data = create_data(n_images)
    for idx in range(n_images):
        img = data[idx]
    # print stats
    print("=" * 40, "CPU", "=" * 40)

    print("=" * 40, "GPU (CUDA)", "=" * 40)

if __name__ == "__main__":
    print("=" * 40, "Finished", "=" * 40)

Appendix B: Environment information

Board and OS info:

I have a NVIDIA Jetson Nano 4GB board with jetpack version 4.4.1. And the OS is:

$ lsb_release -a                                                                                                                                                                                                                                                     
No LSB modules are available.                                                                                                                                                                                                                 
Distributor ID: Ubuntu                                                                                                                                                                                                                        
Description:    Ubuntu 20.04.6 LTS                                                                                                                                                                                                            
Release:        20.04                                                                                                                                                                                                                         
Codename:       focal

Python and cv2 versions:

Python 3.8.10
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cv2
>>> print(cv2.__version__)

And here’s how the CUDA-enabled opencv 4.5.0 was built on the Jetson: (on script on this Github repo) Qengineering/Install-OpenCV-Jetson-Nano, and the particular script is there

CUDA device info

>>> print(cv2.cuda.getCudaEnabledDeviceCount())

>>> cv2.cuda.printCudaDeviceInfo(cv2.cuda.getDevice())
*** CUDA Device Query (Runtime API) version (CUDART static linking) *** 

Device count: 1

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.20 / 10.20
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3964 MBytes (4156813312 bytes)
  GPU Clock Speed:                               0.92 GHz
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           0 / 0
  Compute Mode:
      Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version  = 10.20, CUDA Runtime Version = 10.20, NumDevs = 1
1 Like

Firstly fantastic MRE. Unfortunately in this case is was probably unecessary but its still great.

OK when you first call an OpenCV CUDA function (e.g. d_src.upload(img)) the CUDA context is initialized which has a significant delay. A “standard” convention for initializing the CUDA context in OpenCV is to call cuda::setDevice() during the intialization of your program, however because OpenCV uses the CUDA runtime API calling any CUDA function will have the same effect.

Additionaly if you encounter an additional initialization delay the first time you call a CUDA function which launches a CUDA kernel then this will most likely be due to the driver loading that code onto the device. You can check for this by timing the same function again directly afterwards.

Althoug the link to the build script is not working, so I can’t be sure, I do not suspect this is a PTX compilation delay because the delay is small and consistent over multiple runs (your JIT cache, see CUDA_CACHE_MAX, should be big enough not to have to evict JIT compiled PTX code between runs).

1 Like