Hi, I am seeking to do some performance benchmarking. I am working on a host that has for Nvidia T4 (Tesla) GPUs. I have successfully built FFMPEG with cuda enabled, and I have a control host that I am working on that has no GPUs, and its out-of-the-box ffmpeg of Ubuntu 20.04.
In both cases, when reading “pro-res” video, mpeg2 formatted, 25 FPS and 1920x1080 resolution, running through a loop of
import datetime
import os
cap = cv2.VideoCapture('my_input_video.m2v', cv2.CAP_FFMPEG)
fn = 1
while cap.isOpened():
ret, frame = cap.read()
fn += 1
if (fn == 10000):
break
else:
if (fn % 1000) == 0:
curr_time = datetime.datetime.now()
lapsed_seconds = (curr_time - first_start).seconds
first_start = curr_time
fps = 1000.0/float(lapsed_seconds)
print(f"processed { fn } frames at a rate of { fps } frames per second")
cap.release()
end_time = datetime.datetime.now()
total_lapsed_seconds = (curr_time - start_time).seconds
print(f"processed { fn } frames at a rate of { fps } frames per second")
In both the GPU and CPU cases, I get (typically) 500 FPS read. I can inspect individual frames and they are exactly as expected, so I know the reading is working. But I wonder if there is a buffer, either in opencv or in the libffmpeg bindings that rate limits to a max of 500 FPS. Does anyone know if such a buffer exists? And if it does exist, where in the source code could I modify it to tweak it; I suspect I am not seeing the true max of frames that can be ingested per second.
Your thought and input are greatly appreciated.
I have a possible bead on this…
in opencv_contrib-4.9.0/modules/cudacodec the source file Blaming opencv_contrib/modules/cudacodec/src/video_reader.cpp at 4.9.0 · opencv/opencv_contrib · GitHub invokes Thread::sleep(1). If I read the code correctly, this forces a sleep of at least 1 ms (system overhead/context switching might net out to more than 1ms). I wanted to file an issue at the git repo but their guidance says ask questions here and file reproducible issues with git. I can do the reproducible part by yanking out the Thread:sleep(1) call and seeing what happens (CPU/GPU runs hot due to super tight loop???) Or faking it with a call to Thread:sleep(0) which forces the context switch of Thread::sleep but essentially asks for resumption as quickly as possible. So the question I ask here that I hope @cudawarped will see is: is this Thread::sleep() call necessary? Could the frame queue emit events instead?
that is called last in a loop, when the loop body was unable to get data.
as long as there’s something in the queue, it returns that immediately. that means it could run faster than 500 or 1000 fps.
you can experiment by removing that line and rebuilding. do you see a difference?
your 500 fps hypothesis can be tested with different hardware.
computers don’t “run hot” like a car might. nothing can break. there is no reason to avoid using the resources fully. that is a beginner’s worry and should quickly be discarded.
not strictly but it wastes resources otherwise, just spinning on a queue that simply doesn’t have anything to give yet.
this is polling.
ideally, there should be a blocking mechanism. maybe there is and whoever wrote this didn’t use it, or it wasn’t yet available, of this is the best that can be done.
Thank you @crackwitz for the commentary. I did run two experiments:
a) I set Thread::sleep(0) that in theory forces the context switch
b) I commented out the Thread::sleep(1) line (2 occurrences in the code).
In both cases, I am still capped at 500 FPS.
Input video is encoded at 50kbps, 25 FPS, 1920x1080 resolution, has ~ 68,000 frames
If I transcode that video to mp4 or hevc, (i.e., less total bytes to read), I still get the upper bound of 500 FPS.
I will go dig to see if I can find the source of the current max rate.
If you are using cv::VideoCapture then this will not have any impact. The module you are refering to (cv::cudacodec::VideoReader) is part of the contrib repo and has “nothing” (it uses it to demux the video) to do with cv::VideoCapture from the main repo.
Regarding cv::cudacodec::VideoReader if you want maximum performace you should increase the number of decode surfaces in use. This will have the side effect that the sleep is unlikely to be called because there will always be a surface available unless you are consuming the decoded frames too slowly.
Just out of interest why do you need such a high frame rate?
Thanks @cudawarped. I cannot go into great detail of why need of such a high frame ingest rate, suffice it to say that we rent AWS hardware by the hour/minute and have some large scale computer vision tasks at hand. The quicker we get it done the less we pay.
Ingest Frames → Do Computer Vision tasks → store data is the nature of the pipeline. Each bit needs to work as quickly as possible.
Separately, yes, I figured out this morning I was barking up the wrong tree re cv2.VideoCapture. Am wrestling with
import cv2
import datetime
import os
reader = cv2.cudacodec.createVideoReader('my_mpeg2_video.m2v')
which is MPEG2 encoded. From CLI, with my cuda enabled ffmpeg, I have no trouble with getting fast (but not yet at desired speeds) HW accelerated video ingest/frame extraction. But for this above snippet, I am getting the following error
[ WARN:0@3608.496] global ffmpeg_video_source.cpp:148 FourccToChromaFormat ChromaFormat not recognized: 0x42323459 (Y42B).
Assuming I420
[ERROR:2@3608.508] global video_parser.cpp:86 parseVideoData OpenCV(4.9.0)
.../opencv_contrib-4.9.0/modules/cudacodec/src/video_decoder.cpp:127:
error: (-210:Unsupported format or combination of formats) Video source is not
supported by hardware video decoder refer to Nvidias
GPU Support Matrix to confirm your GPU supports hardware decoding of the video
sources codec. in function 'create'
Alas no. I work for a media company that would happily ask me to work elsewhere if I did.
I was looking at the cudacodec source code for ffmpeg video, I need to decode what params and init params I can pass. I think if I can initialize the Video reader constructor with codec set to what will translate to MPEG2 then it would not default to the I420 format.
I am away from my work computer at the moment, but I can certainly share the output of ffprobe on the video.
Anyway I’ve had a look and even though I can’t find it documented anywhere it seems like MPEG-2 does not support 4:2:2, see
or search the Nvidia developer forum for mpeg2 422.
The error you were getting
[ERROR:2@3608.508] global video_parser.cpp:86 parseVideoData OpenCV(4.9.0)
.../opencv_contrib-4.9.0/modules/cudacodec/src/video_decoder.cpp:127:
error: (-210:Unsupported format or combination of formats) Video source is not
supported by hardware video decoder refer to Nvidias
GPU Support Matrix to confirm your GPU supports hardware decoding of the video
sources codec. in function 'create'
is a result of querying the decoding support offered on your GPU. I included the error message because I “assumed” the support matrix would contain all the details needed but it appears not.
Anyway it looks like you need a 4:2:0 source and even then I would guess that the decoding may be performed using CUDA not the hardware decoding unit, so it may be slow in addition to eating up your CUDA resources.