Upper bound Frames Per Second read in VideoCapture()?

Stephan_D · April 9, 2024, 12:30am

Hi, I am seeking to do some performance benchmarking. I am working on a host that has for Nvidia T4 (Tesla) GPUs. I have successfully built FFMPEG with cuda enabled, and I have a control host that I am working on that has no GPUs, and its out-of-the-box ffmpeg of Ubuntu 20.04.

In both cases, when reading “pro-res” video, mpeg2 formatted, 25 FPS and 1920x1080 resolution, running through a loop of

import datetime
import os
cap = cv2.VideoCapture('my_input_video.m2v', cv2.CAP_FFMPEG)
fn = 1
while cap.isOpened():
    ret, frame = cap.read()
    fn += 1
    if (fn == 10000):
        break
    else:
        if (fn % 1000) == 0:
            curr_time = datetime.datetime.now()
            lapsed_seconds = (curr_time - first_start).seconds
            first_start = curr_time
            fps = 1000.0/float(lapsed_seconds)
            print(f"processed { fn } frames at a rate of { fps } frames per second")
cap.release()
end_time = datetime.datetime.now()
total_lapsed_seconds = (curr_time - start_time).seconds
print(f"processed { fn } frames at a rate of { fps } frames per second")

In both the GPU and CPU cases, I get (typically) 500 FPS read. I can inspect individual frames and they are exactly as expected, so I know the reading is working. But I wonder if there is a buffer, either in opencv or in the libffmpeg bindings that rate limits to a max of 500 FPS. Does anyone know if such a buffer exists? And if it does exist, where in the source code could I modify it to tweak it; I suspect I am not seeing the true max of frames that can be ingested per second.
Your thought and input are greatly appreciated.

Stephan_D · April 9, 2024, 3:52am

I have a possible bead on this…
in opencv_contrib-4.9.0/modules/cudacodec the source file Blaming opencv_contrib/modules/cudacodec/src/video_reader.cpp at 4.9.0 · opencv/opencv_contrib · GitHub invokes Thread::sleep(1). If I read the code correctly, this forces a sleep of at least 1 ms (system overhead/context switching might net out to more than 1ms). I wanted to file an issue at the git repo but their guidance says ask questions here and file reproducible issues with git. I can do the reproducible part by yanking out the Thread:sleep(1) call and seeing what happens (CPU/GPU runs hot due to super tight loop???) Or faking it with a call to Thread:sleep(0) which forces the context switch of Thread::sleep but essentially asks for resumption as quickly as possible. So the question I ask here that I hope @cudawarped will see is: is this Thread::sleep() call necessary? Could the frame queue emit events instead?

crackwitz · April 9, 2024, 8:08am

that is called last in a loop, when the loop body was unable to get data.

as long as there’s something in the queue, it returns that immediately. that means it could run faster than 500 or 1000 fps.

you can experiment by removing that line and rebuilding. do you see a difference?

your 500 fps hypothesis can be tested with different hardware.

computers don’t “run hot” like a car might. nothing can break. there is no reason to avoid using the resources fully. that is a beginner’s worry and should quickly be discarded.

not strictly but it wastes resources otherwise, just spinning on a queue that simply doesn’t have anything to give yet.

this is polling.

ideally, there should be a blocking mechanism. maybe there is and whoever wrote this didn’t use it, or it wasn’t yet available, of this is the best that can be done.

Stephan_D · April 9, 2024, 4:35pm

Thank you @crackwitz for the commentary. I did run two experiments:
a) I set Thread::sleep(0) that in theory forces the context switch
b) I commented out the Thread::sleep(1) line (2 occurrences in the code).
In both cases, I am still capped at 500 FPS.
Input video is encoded at 50kbps, 25 FPS, 1920x1080 resolution, has ~ 68,000 frames
If I transcode that video to mp4 or hevc, (i.e., less total bytes to read), I still get the upper bound of 500 FPS.
I will go dig to see if I can find the source of the current max rate.

cudawarped · April 9, 2024, 6:29pm

Stephan_D:

I have a possible bead on this…
in opencv_contrib-4.9.0/modules/cudacodec the source file Blaming opencv_contrib/modules/cudacodec/src/video_reader.cpp at 4.9.0 · opencv/opencv_contrib · GitHub invokes Thread::sleep(1). If I read the code correctly, this forces a sleep of at least 1 ms (system overhead/context switching might net out to more than 1ms). I wanted to file an issue at the git repo but their guidance says ask questions here and file reproducible issues with git. I can do the reproducible part by yanking out the Thread:sleep(1) call and seeing what happens (CPU/GPU runs hot due to super tight loop???) Or faking it with a call to Thread:sleep(0) which forces the context switch of Thread::sleep but essentially asks for resumption as quickly as possible. So the question I ask here that I hope @cudawarped will see is: is this Thread::sleep() call necessary? Could the frame queue emit events instead?

If you are using cv::VideoCapture then this will not have any impact. The module you are refering to (cv::cudacodec::VideoReader) is part of the contrib repo and has “nothing” (it uses it to demux the video) to do with cv::VideoCapture from the main repo.

Regarding cv::cudacodec::VideoReader if you want maximum performace you should increase the number of decode surfaces in use. This will have the side effect that the sleep is unlikely to be called because there will always be a surface available unless you are consuming the decoded frames too slowly.

Just out of interest why do you need such a high frame rate?

Stephan_D · April 9, 2024, 9:12pm

Thanks @cudawarped. I cannot go into great detail of why need of such a high frame ingest rate, suffice it to say that we rent AWS hardware by the hour/minute and have some large scale computer vision tasks at hand. The quicker we get it done the less we pay.

Ingest Frames → Do Computer Vision tasks → store data is the nature of the pipeline. Each bit needs to work as quickly as possible.

Separately, yes, I figured out this morning I was barking up the wrong tree re cv2.VideoCapture. Am wrestling with

import cv2
import datetime
import os
reader = cv2.cudacodec.createVideoReader('my_mpeg2_video.m2v')

which is MPEG2 encoded. From CLI, with my cuda enabled ffmpeg, I have no trouble with getting fast (but not yet at desired speeds) HW accelerated video ingest/frame extraction. But for this above snippet, I am getting the following error

[ WARN:0@3608.496] global ffmpeg_video_source.cpp:148 FourccToChromaFormat ChromaFormat not recognized: 0x42323459 (Y42B).
Assuming I420
[ERROR:2@3608.508] global video_parser.cpp:86 parseVideoData OpenCV(4.9.0)
.../opencv_contrib-4.9.0/modules/cudacodec/src/video_decoder.cpp:127:
error: (-210:Unsupported format or combination of formats) Video source is not
supported by hardware video decoder refer to Nvidias
GPU Support Matrix to confirm your GPU supports hardware decoding of the video
sources codec. in function 'create'

my FFMPEG shows

me@myhost:/databricks/python3/lib/python3.11/site-packages/cv2/cudacodec# ffmpeg -codecs | grep mpeg2
ffmpeg version n6.1.1-34-g5e45c27ba9-0ubuntu0.22.04.1 Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.4.0-1ubuntu1~22.04)
  configuration: --prefix=/usr/local --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=x86_64 --enable-gpl --enable-libvpl --enable-stripping --enable-gnutls --enable-shared --enable-nonfree --enable-cuda-nvcc --enable-libnpp --enable-cuvid --enable-nvdec --enable-nvenc --enable-cuda-llvm --enable-ffnvcodec --nvccflags='-gencode arch=compute_75,code=sm_75 -O2' --extra-cflags='-I/usr/local/cuda/include -I/usr/local/include -I/usr/include' --extra-ldflags='-L/usr/local/cuda/lib64 -L/usr/local/lib -L/usr/lib/linux-x86_64 -L/usr/lib64 -L/usr/lib' --disable-static
  libavutil      58. 29.100 / 58. 29.100
  libavcodec     60. 31.102 / 60. 31.102
  libavformat    60. 16.100 / 60. 16.100
  libavdevice    60.  3.100 / 60.  3.100
  libavfilter     9. 12.100 /  9. 12.100
  libswscale      7.  5.100 /  7.  5.100
  libswresample   4. 12.100 /  4. 12.100
  libpostproc    57.  3.100 / 57.  3.100
 DEV.L. mpeg2video           MPEG-2 video (decoders: mpeg2video mpegvideo mpeg2_v4l2m2m mpeg2_qsv mpeg2_cuvid) (encoders: mpeg2video mpeg2_qsv)

And Video Encode and Decode GPU Support Matrix | NVIDIA Developer (under data center tab) clearly states T4 supports MPEG2 hw accelerate decoding.

How would I pass in the property to tell it to assume MPEG2?

Thanks in advance!!!

cudawarped · April 10, 2024, 4:30am

Can you share a sample of the video?

Stephan_D · April 10, 2024, 6:19am

Alas no. I work for a media company that would happily ask me to work elsewhere if I did.

I was looking at the cudacodec source code for ffmpeg video, I need to decode what params and init params I can pass. I think if I can initialize the Video reader constructor with codec set to what will translate to MPEG2 then it would not default to the I420 format.

I am away from my work computer at the moment, but I can certainly share the output of ffprobe on the video.

Thank you for your ongoing help.

cudawarped · April 10, 2024, 8:42am

I understand. If you have any issues in the future I would appreciate a sample with the same encoding, e.g.

ffmpeg -i big_buck_bunny.h264 -pix_fmt yuv422p big_buck_bunny.m2v

Anyway I’ve had a look and even though I can’t find it documented anywhere it seems like MPEG-2 does not support 4:2:2, see

or search the Nvidia developer forum for mpeg2 422.

The error you were getting

[ERROR:2@3608.508] global video_parser.cpp:86 parseVideoData OpenCV(4.9.0)
.../opencv_contrib-4.9.0/modules/cudacodec/src/video_decoder.cpp:127:
error: (-210:Unsupported format or combination of formats) Video source is not
supported by hardware video decoder refer to Nvidias
GPU Support Matrix to confirm your GPU supports hardware decoding of the video
sources codec. in function 'create'

is a result of querying the decoding support offered on your GPU. I included the error message because I “assumed” the support matrix would contain all the details needed but it appears not.

Anyway it looks like you need a 4:2:0 source and even then I would guess that the decoding may be performed using CUDA not the hardware decoding unit, so it may be slow in addition to eating up your CUDA resources.

Topic		Replies	Views
Camera Supports 255 in FFMPEG, but Only getting ~185 in OpenCV Python windows , videoio	14	180	January 3, 2025
Slow frame read from webcam C++ gstreamer , videoio	15	10251	December 8, 2024
How to add ffmpeg options to VideoCapture Python ffmpeg , cuda , videoio	14	24142	June 6, 2022
VideoCapture doesn't start with FFMPEG C++ videoio	7	2336	March 12, 2022
Reading and Writing Videos: Python on GPU with CUDA - VideoCapture and VideoWriter Python	17	28815	February 8, 2024

Upper bound Frames Per Second read in VideoCapture()?

Related topics