OpenCV DNN module is slower on GPU than on CPU (Jetson Nano)


I play around with the OpenCV dnn module on both CPU and GPU on Jetson Nano. I make a very similar post on the Nvidia forum Poor performance of CUDA GPU, using OpenCV DNN module - Jetson Nano - NVIDIA Developer Forums, but I think that the topic is more related to OpenCV than CUDA.

I made some tests using different super-resolution models. The results are as follow:

  1. EDSR x2: CPU: timeout, GPU: timeout.
  2. ESPCN x4: CPU: 0.17469215393066406 s, GPU: 10.169917821884155 s
  3. FSRCNN x4: CPU: 0.12776947021484375 ,GPU: 5.2502007484436035
  4. LapSRN x4: CPU: 8.098081111907959, GPU: 6.410776138305664

One can see that CPU time execution is much smaller than on GPU - which is contrary to logic. I check the resources load during the execution of both CPU and GPU versions and in the first case the CPU load was 100%, but with GPU version the load was at about 20%. It looks like dnn module doesn’t use the full power of GPU.

Why is the performance on GPU is too poor to CPU?
Is there exist a way to decrease the execution time of the GPU version?

GPU initial setup and moving data between GPU and system memory takes time. In many cases GPU acceleration can be observed when processing large batches of data (e.g. frames of a video). Modify your test to process the same image 100 times and compare performance.

You are right. I executed the “upscaling” part of the code in the loop (100 times), and I got much better results using GPU than CPU (about 3x faster).

My goal is to upscale a few images (e. g. 6) with high resolution (e. g. 4000 px x 5000 px) in the shortest time (e. g. 1s per image). How can one archive such results?
Are there any known methods? For example, split one image into a few, upscale sub-images, then merge them into one?