The use of OpenCV's cuda-VideoReader's grab does not contribute to efficiency improvement?

I am using the GPU version of OpenCV to decode video, with the code as follows. The subsequent neural network recognition has not yet been added. Since I do not need frame-by-frame recognition, there is an FPS parameter in the method, which represents the number of image frames per second required for recognition in the video. If not specified, it defaults to the video frame rate. Typically, I set it to 1, meaning that recognizing one image per second is sufficient.

While running the decoding, the GPU usage is only at 3%. I doubt that multithreading would be reliable in this context. I’m wondering if there is a way to fully utilize the GPU. However, even with the current efficiency, if the grab function could be effectively utilized, it would meet my requirements.

The video lasts a total of 300 seconds = 7500 frames.

When I do not set the FPS, the video is decoded frame by frame, and the efficiency is ideal (GPU memory usage is 194M, utilization is 3%, CPU usage is less than 1 core; when run on the CPU, it uses almost all of the 32 cores):

count1= 7499
GPU execution time: 9.850459575653076
count1= 7500
CPU execution time: 14.664050340652466

However, when I set the FPS to 5, the CPU efficiency improves significantly because only one frame is decoded every 5 frames, but the GPU efficiency barely changes.

count1= 1499
GPU execution time: 9.862346649169922
count1= 1500
CPU execution time: 4.066887140274048

When the FPS is set to 1 (i.e., only one frame is decoded for every 25 frames per second in the video), the CPU decoding efficiency further improves, while the GPU remains unchanged.

count1= 300
GPU execution time: 9.844478368759155
count1= 300
CPU execution time: 2.5899405479431152

Here is the code snippet:

import cv2
import numpy as np
import time

# Read video using CPU
def read_video_cpu(video_path, fps=None):
    cap = cv2.VideoCapture(video_path)
    cap_fps = cap.get(cv2.CAP_PROP_FPS)
    if fps is None:
        fps = cap_fps
    count1 = 0
    ret = cap.grab()
    was_read, img = cap.retrieve()
    while True:
        for i in range(int((count1 + 1) * cap_fps / fps) - int(count1 * cap_fps / fps)):
            if cap.get(cv2.CAP_PROP_POS_MSEC) >= (count1 + 1) * 1000 - 18:
                break
            cap.grab()
        ret, img = cap.retrieve()
        count1 += 1
        if not ret or img is None:
            break
    print("count1=", count1)
    cap.release()

# Read video using GPU
def read_video_gpu(video_path, fps=None):
    if not cv2.cuda.getCudaEnabledDeviceCount():
        print("CUDA is not available. Please make sure CUDA drivers are installed.")
        return
    cap = cv2.cudacodec.createVideoReader(video_path)
    cap_fps = cap.get(cv2.CAP_PROP_FPS)[1]
    if fps is None:
        fps = cap_fps
    count1 = 0
    ret = cap.grab()
    was_read, img = cap.retrieve()
    while True:
        for i in range(int((count1 + 1) * cap_fps / fps) - int(count1 * cap_fps / fps)):
            if cap.get(cv2.CAP_PROP_POS_MSEC)[1] >= (count1 + 1) * 1000 - 18:
                break
            ret = cap.grab()
        if not ret:
            break
        ret, img = cap.retrieve()
        count1 += 1
        if not ret or img is None:
            break
    print("count1=", count1)
    # cap.release()

video_path = "/root/1.mp4"
fps = None
time1 = time.time()
read_video_gpu(video_path,fps)
print('GPU execution time:', time.time() - time1)
time1 = time.time()
read_video_cpu(video_path,fps)
print('CPU execution time:', time.time() - time1)

I suspect you are observing the CUDA not decode utilization.


VideoReader.grab() does not work in the same way on the GPU. It is exactly the same as nextFrame() except it doen’t return anything so the processing utilization is the same. Therefore you should avoid any logic like the below and use nextFrame() instead.

for i in range(int((count1 + 1) * cap_fps / fps) - int(count1 * cap_fps / fps)):
    if cap.get(cv2.CAP_PROP_POS_MSEC)[1] >= (count1 + 1) * 1000 - 18:
        break
    ret = cap.grab()

I’m not sure what hardware you are using but to fully saturate the decoder you may need to increase the number of decode surfaces. Try

params = cv2.cudacodec.VideoReaderInitParams()
params.minNumDecodeSurfaces=10
cap = cv2.cudacodec.createVideoReader(video_path, params=params)

The ideal number of surfaces depends on the workload and GPU and will require more memory on the GPU, see the below taken from the Nvidia Video Codec SDK docs

The following steps should be followed for optimizing video memory usage:

  1. Make CUVIDDECODECREATEINFO::ulNumDecodeSurfaces = CUVIDEOFORMAT:: min_num_decode_surfaces. This will ensure that the underlying driver allocates minimum number of decode surfaces to correctly decode the sequence. In case there is reduction in decoder performance, clients can slightly increase CUVIDDECODECREATEINFO::ulNumDecodeSurfaces. It is therefore recommended to choose the optimal value of CUVIDDECODECREATEINFO::ulNumDecodeSurfaces to ensure right balance between decoder throughput and memory consumption.
  2. CUVIDDECODECREATEINFO::ulNumOutputSurfaces should be decided optimally after due experimentation for balancing decoder throughput and memory consumption.

I am usingRTX A4000 hardware with Compute Capability of 8.6. After I made the code changes, the time remained unchanged. When params.minNumDecodeSurfaces=10, GPU execution time: 9.818438

My ultimate goal is to detect objects in each second of video using YOLOv8. After deploying the model using TensorRT, single image prediction is much faster than CPU decoding using OpenCV. Therefore, I hope to improve the decoding efficiency by using GPU OpenCV. This is just the first step. In the future, I will also need to directly import GPU OpenCV into PyCUDA, and my TensorRT model can also be set to multiple batch sizes to improve prediction performance.
If the grab() function of the GPU cannot achieve the same effect as the CPU, does it mean that I have wasted a lot of decoding performance?

It means that you are not doing anything different to just calling nextFrame() so the grab()/retrieve() combination is uneccessary and redundant.

Its possible for the frame size and CPU you are using that CPU decoding is faster. I would check what the GPU decoder utilization is. If its less than 100% try increaseing minNumDecodeSurfaces futher or just see if

params.minNumDecodeSurfaces=100

makes a difference.

You should also be able to improve the CPU timings by passing in the existing img in the loop to avoid reallocation on every iteration. i.e.

...
            cap.grab()
        ret, _= cap.retrieve(img)
...