Reading and Writing Videos: Python on GPU with CUDA - VideoCapture and VideoWriter

Hi, what processing are you doing on the GPU and what resolution is your video? I am asking because it doesn’t sound like the decoding is your bottleneck. If it is and you can do the processing on the CPU and you have an Intel CPU and your file is in a raw format (e.g. .h264,.h265…) then I would also recommend using the Intel Quick Sync HW decoder as suggested by ChaoticActivity, i.e.

cap = cv2.VideoCapture(VID_PATH,cv2.CAP_INTEL_MFX)

On the latest Intel CPU this should be as quick as the Nvidia HW decoder, plus it may be much quicker at writing than using Ffmpeg etc.

If you have to use the GPU I would recommend pre-allocating your GpuMat arrays and passing them to your functions as the dst argument if this is possible, i.e. use

frame = cv2.cuda_GpuMat(n,m,cv2.CV_8UC4)
gpu_frame.download(frame)

instead of

frame = gpu_frame.download()

because the later will force memory to be allocated for frame on every invocation, slowing things down significantly.

To “patch” cudacodec to work with Nvidia Video Codec SDK 11.0 you need to add the AV1 codec, see this.

I don’t have an example of using threads with python to hand but you would need to split your video into separate files equal to the number of threads, so you would need to check if this approach is suitable for the processing you are doing.

To use pinned memory and streams in python see here. Using streams would give you the ability to decode frame 0 on the GPU, perform your processing and start the download in stream 0, then move to decoding frame 1, processing, download in stream 1, then start writing frame 0. In this scenario the downloading and writing of frame 0 may happen at the same time as the decoding or processing on the GPU. Essentially using streams would allow you to perform processing on the CPU and GPU at the same time and avoid the overhead of downloading from the device to the host, if you optimize your code in the right way. Unfortunately this will very much depend on exactly what you are doing and the above may not be optimal for you, for example you may have the resources available to perform both the decode and encode on the CPU because the time spent processing on the GPU is so large.