Reading Video Signal with CPU vs GPU

I am still new to OpenCV and CV in general. I have some beginner questions regarding the VideoIO with CPU and GPU:

  1. Is it possible to read a video signal directly via the GPU? Does it make sense to do so or is the CPU involved one way or another?

  2. Is my understanding correct that the cv::VideoCapture uses the CPU and the cv::cudacodec::VideoReader uses the GPU?

  3. What are the advantages and disadvantages? Does someone have experience with performance differences?

  4. Is the cudacodec::VideoReader example from the OpenCV GitHub working for you?

Thank you very much for your answers!

Yes using cv::cudacodec::VideoReader. Whether it makes sense will depend on your use case, e.g. for non production code while you are prototyping it may make more sense to keep everything on the host rather than the device for debugging investigation purposes.

The CPU is involved, parsing the video file with FFMpeg and launching the device jobs but not in the decoding which I guess is what your asking.

Yes and no. cv::VideoCapture has hardware GPU acceleration built in now if available on the backend. The downside is that this in my experiance is much much slower than cv::cudacodec::VideoReader and the decoded frame resides on the host not the GPU. Again depending on your use case these may not be downsides.

As above this very much depends on what you want to do, but in my opinion you have:

  • Advantages

    1. You can decode more streams more quickly.
    2. The decode frame is on the device, if you are doing futher processing in CUDA this saves the upload penalty.
    3. cudacodec::VideoReader also has functionality which is not available in cv::VideoCapture, for example:
    • You can choose the color format of the returned frame (on every call to read if you want) to be either BGR, BGRA or GRAY with very little penalty, GRAY conversion is free. On the host the frame is always BGR so if you wanted BGRA for example you would have suffer the conversion overhead from cvtColor after decoding.
    • You can automatically drop frames when injesting from an RTSP source more slowly than the sources fps, ideal for prototyping when streaming from an IP camera
      cv::cudacodec::VideoReaderInitParams params;
      params.allowFrameDrop= true;
      cv::cudacodec::VideoReaderInitParams:: cv::Ptr<cv::cudacodec::VideoReader> reader = cv::cudacodec::createVideoReader(inputFile, {}, params);
    • You can read both the raw bit stream and the decoded frame at the same time, ideal if you want to store the exact video which you processed for future investigation and not a re-encode which may lead to different results.
      cv::cudacodec::VideoReaderInitParams params;
      params.rawMode = true;
      cv::cudacodec::VideoReaderInitParams:: cv::Ptr<cv::cudacodec::VideoReader> reader = cv::cudacodec::createVideoReader(inputFile, {}, params);
  • Disadvantages

    1. The decoded frame resides on the device, if all your processing is on the host, you may loose peformance benefit of decoding on the device when you download the frame back to the host.
    2. You have to build OpenCV yourself with CUDA and install the Nvidia Video Codec SDK meaning you can’t use the pre-compiled binaries from OpenCV or the python modules.
    3. It doesn’t handle as many codecs.
    4. You have to use the FFMpeg VideoCapture backend unless you provide your own parser by inheriting from cudacodec::RawVideoSource and using
      Ptr<VideoReader> createVideoReader(const Ptr<RawVideoSource>& source, const VideoReaderInitParams params = VideoReaderInitParams());

Yes I need to re-write this example due to some changes in the way cudacodec::VideoReader works but it gives you an idea of the decoding speed in the table at the bottom.

Its worth noting that the decoding performance from VideoCapture is completely dependant on the model of CPU, some low level CPU’s may only be able to decode 4 1080p@25fps h265 streams at the same time before reaching 100% and even some older generation high end models may struggle to decode even 10 streams simultaneously. That said all modern Nvidia cards have roughly the same performance as the decoding is performed on a dedicated hardware chip which has minor improvements each generation. Therefore a GTX 1050 should have the same performance as a 1080Ti and almost the same performance as an RTX 3090.

An example of the performace difference from my own application where I either decode a number of 1080p@25fps h265 streams with VideoCapture and the FFmpeg backend or cudacodec::VideoReader on a laptop with an i7-12700H CPU and an RTX 3070 Ti. Using VideoCapture I can decode 18 streams before reaching 100% CPU utilization by which time my application doesn’t function as all the resources are used up on decoding. However with cudacodec::VideoReader I can easilty decode 60 streams without reaching 100% decoding capacity and only using 11% of the CUDA compute capacity (for color conversion) and 25% of the CPU. Therefore as well as decoding more streams cudacodec::VideoReader also leaves both CUDA and CPU resourses available for the rest of my application.

I have personally never tried it, that said I can confirm the cudacodec C++ and python tests all work and demonstrate most of its capabilities.