Hi! I was trying to compare the execution times of some cv2
operations (e.g. Canny edge detection, HoughLines) between standard CPU version and CUDA-enabled versions of the opencv library. But during the process, the two findings about cv2.cuda_GpuMat.upload()
were confusing for me and I hope to get some help to understand them better.
(The test script is attached below in Appendix A, and the environment information, e.g. GPU device, OS, python/cuda versions, are in Appendix B).
Finding 1
In the first 2 experiments, GpuMat.upload()
seemed to involve some heavy overhead.
By running the comparison over the same 10 randomly generated images, but once with data size (480, 640)
, and once with (2, 2)
, it took about the same time for the GpuMat.upload()
calls across different image sizes.
And here’re the related experiment results (1.1 and 1.2), followed by questions.
Experiment 1.1 (realistic image size)
# python3 benchmark.py
Creating [10] random test images of size: (480, 640)
======================================== CPU ========================================
20 function calls in 0.142 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10 0.142 0.014 0.142 0.014 {Canny}
10 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
======================================== GPU (CUDA) ========================================
40 function calls in 3.197 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10 0.160 0.016 0.160 0.016 {method 'detect' of 'cv2.cuda_CannyEdgeDetector' objects}
10 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
10 0.000 0.000 0.000 0.000 {method 'download' of 'cv2.cuda_GpuMat' objects}
10 3.037 0.304 3.037 0.304 {method 'upload' of 'cv2.cuda_GpuMat' objects}
Experiment 1.2 (very small size)
python3 benchmark.py
Creating [10] random test images of size: (2, 2)
======================================== CPU ========================================
20 function calls in 0.001 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10 0.001 0.000 0.001 0.000 {Canny}
10 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
======================================== GPU (CUDA) ========================================
40 function calls in 3.141 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10 0.010 0.001 0.010 0.001 {method 'detect' of 'cv2.cuda_CannyEdgeDetector' objects}
10 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
10 0.000 0.000 0.000 0.000 {method 'download' of 'cv2.cuda_GpuMat' objects}
10 3.131 0.313 3.131 0.313 {method 'upload' of 'cv2.cuda_GpuMat' objects}
Questions to Finding 1
After seeing the above results, I checked the source code for GpuMat for any operation that could add an overhead regardless of data shape.
I was suspecting the create(...
call in the upload() function caused the overhead, with releasing and allocating memory. But the create() function does try to avoid any work if the data shape and type are the identical (line 160), which is true in my case.
So the questions are:
- Is there anything wrong with the test script?
- Otherwise where do these overhead come from?
(Please do read the next finding, too. It overrules this finding of overheads.)
Finding 2
Having a global call to any cv2.cuda_GpuMat
object’s upload()
function before the tests start, solves the overhead problem. In particular, this line (near the top of the test script below) was uncommented:
# cv2.cuda_GpuMat().upload(np.random.randint(0, 256, G_DATA_SHAPE, dtype=np.uint8))
Here’s the related experiment result followed by questions.
Experiment 2
# python3 benchmark.py
Creating [10] random test images of size: (480, 640)
======================================== CPU ========================================
20 function calls in 0.142 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10 0.142 0.014 0.142 0.014 {Canny}
10 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
======================================== GPU (CUDA) ========================================
40 function calls in 0.240 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10 0.229 0.023 0.229 0.023 {method 'detect' of 'cv2.cuda_CannyEdgeDetector' objects}
10 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
10 0.000 0.000 0.000 0.000 {method 'download' of 'cv2.cuda_GpuMat' objects}
10 0.011 0.001 0.011 0.001 {method 'upload' of 'cv2.cuda_GpuMat' objects}
Questions to Finding 2
- The object on the uncommented line, created with
cv2.cuda_GpuMat()
was not even used in subsequent script lines. How does this affect the later use ofd_src.upload(img)
? Is there some kind of singleton behind allGpuMat
s? - Without that line,
d_src.upload(img)
was also called multiple times throughout the for-loop. Why doesn’t that have the same effect? - The script’s workflow of
cv2.cuda...
seems common after some research online. Suppose the experiments were conducted correctly and can be reproduced, what’s the proper way of “warming up” the GpuMat’supload()
function?
Lastly
I’d appreciate it for any comments, hints or directions on these bizarre findings. It would be amazing to understand and find a reasonable way to warm up the upload part. Thank you in advance!
Appendix A: Profiling script
"""
Simple script to compare cpu and cuda versions of CannyEdgeDetector in opencv
What happens: some random images (of the same size) are created, and both versions of the detector run on each image. The times of the function calls are profiled.
Configure:
* Change the randomly created image size with variable G_DATA_SHAPE;
* Change the number of randomly created images with variable NUM_TEST_IMAGES;
python3 benchmark.py
"""
import numpy as np
import cv2
import cProfile, pstats
G_DATA_SHAPE = (480, 640)
NUM_TEST_IMAGES = 10
# cv2.cuda_GpuMat().upload(np.random.randint(0, 256, G_DATA_SHAPE, dtype=np.uint8))
prof_cpu = cProfile.Profile()
prof_gpu = cProfile.Profile()
def create_data(n_images: int):
print(f"Creating [{n_images}] random test images of size: {G_DATA_SHAPE}\n")
data = []
for _ in range(n_images):
# Create synthetic grayscale image data
image_data = np.random.randint(0, 256, G_DATA_SHAPE, dtype=np.uint8)
data.append(image_data)
return data
# Initialize CUDA Canny detector
cuda_canny = cv2.cuda.createCannyEdgeDetector(100, 200)
# containers for cuda data
d_src = cv2.cuda_GpuMat()
d_dst = cv2.cuda_GpuMat()
def fn_cpu_canny(img):
prof_cpu.enable()
edges = cv2.Canny(img, 100, 200)
prof_cpu.disable()
def fn_cuda_canny(img):
prof_gpu.enable()
d_src.upload(img)
# detect and obtain results as np array
cuda_canny.detect(d_src, d_dst)
result = d_dst.download()
prof_gpu.disable()
def test_random_and_gather_stats(n_images: int):
"""
1. create n random test images with shape G_DATA_SHAPE
2. run the cpu and cuda canny detector (with upload/download)
3. print the stats of function calls
"""
data = create_data(n_images)
for idx in range(n_images):
img = data[idx]
fn_cpu_canny(img)
fn_cuda_canny(img)
# print stats
print("=" * 40, "CPU", "=" * 40)
prof_cpu.print_stats()
print("=" * 40, "GPU (CUDA)", "=" * 40)
prof_gpu.print_stats()
if __name__ == "__main__":
test_random_and_gather_stats(n_images=NUM_TEST_IMAGES)
print("=" * 40, "Finished", "=" * 40)
Appendix B: Environment information
Board and OS info:
I have a NVIDIA Jetson Nano 4GB board with jetpack version 4.4.1
. And the OS is:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal
Python and cv2 versions:
Python 3.8.10
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cv2
>>> print(cv2.__version__)
4.5.0
And here’s how the CUDA-enabled opencv 4.5.0 was built on the Jetson: (on script on this Github repo) Qengineering/Install-OpenCV-Jetson-Nano
, and the particular script is OpenCV-4-5-0.sh
there
CUDA device info
>>> print(cv2.cuda.getCudaEnabledDeviceCount())
1
>>> cv2.cuda.printCudaDeviceInfo(cv2.cuda.getDevice())
*** CUDA Device Query (Runtime API) version (CUDART static linking) ***
Device count: 1
Device 0: "NVIDIA Tegra X1"
CUDA Driver Version / Runtime Version 10.20 / 10.20
CUDA Capability Major/Minor version number: 5.3
Total amount of global memory: 3964 MBytes (4156813312 bytes)
GPU Clock Speed: 0.92 GHz
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 0 / 0
Compute Mode:
Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.20, CUDA Runtime Version = 10.20, NumDevs = 1