More CPU usage when using VideoCapture with hardware accelleration


we want to reduce the CPU load of one of our services running on an Intel NUC, which grabs images from an rtsp stream and processes them. We now tried to use the hardware acceleration however the CPU usage rises (from about 25% without hw acceleration to about 40%). The stream is h264 from an IP camera, we are using OpenCV 4.5.4 and ffmpeg 4.4.

Are we doing something wrong? Because when using ffmpeg in the command line, the h264_qsv decoder reduces CPU usage compared to h264 decoder.

Here the Dockerfile we are using (not cleaned up yet), the vainfo output and the output with hardware acceleration enabled and enabled debug output.

ARG BASEIMAGE=ubuntu:20.04
FROM ${BASEIMAGE} as builder
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get upgrade -y && \
    apt-get install -y --no-install-recommends \
        build-essential \
        software-properties-common \
        cmake \
        git \
        libdc1394-22-dev \
        libeigen3-dev  \
        libjpeg8-dev \
        libpng-dev \
        libtbb2 \
        libtbb-dev \
        libtiff5-dev \
        libswscale-dev \
        vim \
        nano \
        pkg-config \
        python3-dev \
        python3-numpy \
        python3-pip \
        python3-scipy \
        x264 \
        autoconf \
        automake \
        libass-dev \
        libfreetype6-dev \
        libgnutls28-dev \
        libmp3lame-dev \
        libtool \
        libvorbis-dev \
        meson \
        ninja-build \
        pkg-config \
        texinfo \
        wget \
        yasm \
        zlib1g-dev \
        libunistring-dev \
        nasm \
        libx264-dev \
        libx265-dev \
        libnuma-dev \
        libvpx-dev \
        libmfx-dev \
        intel-media-va-driver-non-free \
        libmfx1 \
        libva-drm2 \
RUN add-apt-repository ppa:savoury1/ffmpeg4 && add-apt-repository ppa:savoury1/graphics && add-apt-repository ppa:savoury1/multimedia
RUN apt-get update && apt-get install -y libavcodec-dev libavc1394-0 libavdevice-dev libavfilter-dev libavformat-dev libavutil-dev ffmpeg libva-dev
ENV OPENCV_ROOT=/tmp/opencv
RUN git clone -b 4.5.4 && \
    mkdir $OPENCV_ROOT/opencv/build && cd $OPENCV_ROOT/opencv/build && \
          -D CMAKE_INSTALL_PREFIX=/opt/opencv \
          -D WITH_TBB=ON \
          -D WITH_V4L=ON \
          -D WITH_QT=OFF \
          -D WITH_OPENGL=OFF \
          -D WITH_FFMPEG=ON \
          -D WITH_GTK=OFF \
          -D WITH_CUDA=OFF \
          -D WITH_MFX=OFF \
          -D WITH_GTK_2_X=OFF \
          -D WITH_OPENEXR=OFF \
          -D BUILD_opencv_dnn=OFF \
          -D BUILD_opencv_highgui=OFF \
          -D BUILD_opencv_ml=OFF \
          -D BUILD_opencv_photo=OFF \
          -D BUILD_opencv_gapi=OFF \
          -D BUILD_opencv_calib3d=OFF \
          -D BUILD_TESTS=OFF \
          -D BUILD_EXAMPLES=OFF .. && \
    make -j"$(nproc)" && \
    make install && \
    echo "/opt/opencv/lib" >> /etc/ && ldconfig && \
    rm -rf $OPENCV_ROOT
RUN ln -s "/opt/opencv/python/cv2/python-3.8/" /usr/local/lib/python3.8/dist-packages/
RUN cp -r /opt/opencv/lib/* /usr/local/lib/
RUN cp -r /opt/opencv/include/opencv4/* /usr/local/include/opencv4/
RUN cp -r /opt/opencv/lib/python3.8/dist-packages/cv2/* /usr/local/lib/python3.8/dist-packages/cv2

vainfo output
root@senfth-NUC8i3BEH:/tmp/opencv# vainfo
error: XDG_RUNTIME_DIR not set in the environment.
error: can't connect to X server!
libva info: VA-API version 1.13.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/
libva info: Found init function __vaDriverInit_1_7
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.13 (libva 2.6.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 20.1.1 ()
vainfo: Supported profile and entrypoints
      VAProfileNone                   : VAEntrypointVideoProc
      VAProfileNone                   : VAEntrypointStats
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Simple            : VAEntrypointEncSlice
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointFEI
      VAProfileH264Main               : VAEntrypointEncSliceLP
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointFEI
      VAProfileH264High               : VAEntrypointEncSliceLP
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointEncPicture
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264ConstrainedBaseline: VAEntrypointFEI
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
      VAProfileVP8Version0_3          : VAEntrypointVLD
      VAProfileVP8Version0_3          : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointFEI
      VAProfileHEVCMain10             : VAEntrypointVLD
      VAProfileHEVCMain10             : VAEntrypointEncSlice
      VAProfileVP9Profile0            : VAEntrypointVLD
      VAProfileVP9Profile2            : VAEntrypointVLD
Debug Output
In [20]: cap = cv2.VideoCapture(stream_url, cv2.CAP_FFMPEG, [cv2
    ...: .CAP_PROP_HW_ACCELERATION, cv2.VIDEO_ACCELERATION_ANY])                                                     
[ WARN:0] global /tmp/opencv/opencv/modules/videoio/src/cap.cpp (130) open VIDEOIO(FFMPEG): trying capture filename='rtsp://administrator:@' ...
[DEBUG:0] global /tmp/opencv/opencv/modules/videoio/src/cap_ffmpeg_impl.hpp (1039) open FFMPEG: stream[0] is video stream with codecID=27 width=1280 height=720
[DEBUG:0] global /tmp/opencv/opencv/modules/videoio/src/cap_ffmpeg_hw.hpp (929) HWAccelIterator FFMPEG: allowed acceleration types (any): 'vaapi.iHD,'
[DEBUG:0] global /tmp/opencv/opencv/modules/videoio/src/cap_ffmpeg_hw.hpp (947) HWAccelIterator FFMPEG: disabled codecs: 'av1.vaapi,av1_qsv,vp8.vaapi,vp8_qsv'
[DEBUG:0] global /tmp/opencv/opencv/modules/videoio/src/cap_ffmpeg_impl.hpp (1071) open FFMPEG: trying to configure H/W acceleration: 'vaapi.iHD'
[ INFO:0] global /tmp/opencv/opencv/modules/videoio/src/cap_ffmpeg_hw.hpp (272) hw_check_device FFMPEG: Using vaapi video acceleration on device: Intel iHD driver for Intel(R) Gen Graphics - 20.1.1 ()
[ INFO:0] global /tmp/opencv/opencv/modules/videoio/src/cap_ffmpeg_hw.hpp (562) hw_create_device FFMPEG: Created video acceleration context (av_hwdevice_ctx_create) for vaapi on device 'default'
[ WARN:0] global /tmp/opencv/opencv/modules/videoio/src/cap.cpp (142) open VIDEOIO(FFMPEG): created, isOpened=1


It is hard to say what is the difference, I’d recommend trying the following things:

Thanks for your suggestions:
Here the output of time for the suggested tests (1000 frames with 25fps and h264 stream):

real    0m41.013s
user    0m10.956s
sys 0m0.800s
CAP_FFMPEG, no hw acceleration
real    0m40.995s
user    0m7.614s
sys 0m0.596s
CAP_FFMPEG, OPENCV_FFMPEG_CAPTURE_OPTIONS="hwaccel;qsv|video_codec;h264_qsv|vsync;0"
real    0m41.180s
user    0m12.030s
sys 0m1.021s
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
[ WARN:0] global /tmp/opencv/opencv/modules/videoio/src/cap_gstreamer.cpp (1063) open OpenCV | GStreamer warning: unable to query duration of stream
[ WARN:0] global /tmp/opencv/opencv/modules/videoio/src/cap_gstreamer.cpp (1100) open OpenCV | GStreamer warning: Cannot query video position: status=1, value=0, duration=-1

real	0m40.866s
user	0m10.402s
sys	0m0.957s

According to intel_gpu_top the hardware acceleration works

Usually decoding on CPU uses all cores, so probably your processing is limited by incoming stream FPS and/or rendering a window (40 sec = 1000 frames / 25 fps) and differences in CPU load are minor with this minimal example. So you won’t be able to tell which method is better for the final application until you add actual image processing to the pipeline (and omit rendering).

And if you just want to benchmark the device in different modes to find out what is it capable of you should use a video file (and disable rendering).

We now tested the hardware acceleration within our full software service (running inside a docker container). This services grabs the frames from the camera stream, forwards the frames to our neural network service via shared memory and then collects the results and sends them to our other services. There is no rendering involved and beside of the stream decoding not much processing power is needed.

Both testing within Docker containers and running on host system show about 40% CPU usage with hardware acceleration and about 30% without, while with ffmpeg hardware acceleration reduces CPU usage from about 40% to less than 20% (first command vs second command).

ffmpeg -c:v h264 -i rtsp://administrator:@\=u -pix_fmt yuv420p /tmp/output.yuv
ffmpeg -hwaccel qsv -c:v h264_qsv -i rtsp://administrator:@\=u -vf hwdownload,format=nv12 -pix_fmt yuv420p /tmp/output2.yuv

Running a profiler on the application showed that most of the processing time was in sws_scale and also a lot in av_hw_frame_transfer_data

I have little experience with FFMPEG so I’m not sure why this is the case.

1 Like

Well, that does not change much comparing to previous measurements since you are doing heavy processing on another server. It is still limited by stream FPS and it seems that in this case you do not benefit from HW acceleration. It can be partially related to additional processing done by CPU (color conversion to BGR - sws_scale, frame copying from video memory - av_hw_frame_transfer_data).

Your FFmpeg processing numbers are lower probably because you do not convert to BGR (which OpenCV always does) and because nv12->yuv420p conversion can be performed on GPU while nv12->bgr can not (not sure about this).

Thanks for your help. It’s indeed the conversion to bgr. With bgr24 output ffmpeg shows the same behavior (more CPU load when using hardware acceleration).

So I guess the best way to reduce CPU load is to adapt OpenCV to convert to grayscale as we only process grayscale images. This reduces the CPU usage from about 30% to 20%.

There is flag CAP_PROP_CONVERT_RGB which is supposed to turn off BGR conversion, but I think it is not implemented for FFmpeg backend. I’ve found this old PR - Support CV_CAP_MODE_GRAY in FFMPEG backend by cristiklein · Pull Request #9123 · opencv/opencv · GitHub - theoretically you should be able to port it to your OpenCV.

You can try to use GStreamer backend as it allows writing custom pipeline with custom output format (should be supported opencv/cap_gstreamer.cpp at 4b6047e746f06ce3e595c886cf6c0266498c6a67 · opencv/opencv · GitHub).

We already implemented it ourself (but it’s quiet similar to the pull request you found): GitHub - accessio-gmbh/opencv at own_4.5.4
However this is just a quick fix for us and would need some more work for a pull request.

1 Like