Build opencv with cuda and Video Codec SDK support for acceleration not working

Hi, I’m building opencv with cuda and Video Codec SDK for video encoding and decoding acceleration, the building process is fine, but it seems not working. and I don’t know where can be the problem. PLZ help me with this.
here is my building information.

General configuration for OpenCV 4.8.0 =====================================
  Version control:               unknown

  Extra modules:
    Location (extra):            D:/gongwu_env/build_cv/opencv_contrib-4.8.0/modules
    Version control (extra):     unknown

  Platform:
    Timestamp:                   2024-09-12T09:07:49Z
    Host:                        Windows 10.0.19045 AMD64
    CMake:                       3.26.4
    CMake generator:             Visual Studio 16 2019
    CMake build tool:            C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/MSBuild/Current/Bin/MSBuild.exe
    MSVC:                        1929
    Configuration:               Release

  CPU/HW features:
    Baseline:                    SSE SSE2 SSE3
      requested:                 SSE3
    Dispatched code generation:  SSE4_1 SSE4_2 FP16 AVX AVX2 AVX512_SKX
      requested:                 SSE4_1 SSE4_2 AVX FP16 AVX2 AVX512_SKX
      SSE4_1 (18 files):         + SSSE3 SSE4_1
      SSE4_2 (2 files):          + SSSE3 SSE4_1 POPCNT SSE4_2
      FP16 (1 files):            + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 AVX
      AVX (8 files):             + SSSE3 SSE4_1 POPCNT SSE4_2 AVX
      AVX2 (37 files):           + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2
      AVX512_SKX (8 files):      + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2 AVX_512F AVX512_COMMON AVX512_SKX

  C/C++:
    Built as dynamic libs?:      YES
    C++ standard:                11
    C++ Compiler:                C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/VC/Tools/MSVC/14.29.30133/bin/Hostx64/x64/cl.exe  (ver 19.29.30154.0)
    C++ flags (Release):         /DWIN32 /D_WINDOWS /W4 /GR  /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:fast     /EHa /wd4127 /wd4251 /wd4324 /wd4275 /wd4512 /wd4589 /wd4819 /MP  /MD /O2 /Ob2 /DNDEBUG 
    C++ flags (Debug):           /DWIN32 /D_WINDOWS /W4 /GR  /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:fast     /EHa /wd4127 /wd4251 /wd4324 /wd4275 /wd4512 /wd4589 /wd4819 /MP  /MDd /Zi /Ob0 /Od /RTC1 
    C Compiler:                  C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/VC/Tools/MSVC/14.29.30133/bin/Hostx64/x64/cl.exe
    C flags (Release):           /DWIN32 /D_WINDOWS /W3  /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:fast     /MP   /MD /O2 /Ob2 /DNDEBUG 
    C flags (Debug):             /DWIN32 /D_WINDOWS /W3  /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:fast     /MP /MDd /Zi /Ob0 /Od /RTC1 
    Linker flags (Release):      /machine:x64  /INCREMENTAL:NO 
    Linker flags (Debug):        /machine:x64  /debug /INCREMENTAL 
    ccache:                      NO
    Precompiled headers:         NO
    Extra dependencies:          cudart_static.lib nppc.lib nppial.lib nppicc.lib nppidei.lib nppif.lib nppig.lib nppim.lib nppist.lib nppisu.lib nppitc.lib npps.lib cublas.lib cudnn.lib cufft.lib -LIBPATH:C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.2/lib/x64
    3rdparty dependencies:

  OpenCV modules:
    To be built:                 alphamat aruco bgsegm bioinspired calib3d ccalib core cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev datasets dnn dnn_objdetect dnn_superres dpm face features2d flann fuzzy gapi hfs highgui img_hash imgcodecs imgproc intensity_transform line_descriptor mcc ml objdetect optflow phase_unwrapping photo plot python3 quality rapid reg rgbd saliency shape stereo stitching structured_light superres surface_matching text tracking ts video videoio videostab wechat_qrcode world xfeatures2d ximgproc xobjdetect xphoto
    Disabled:                    -
    Disabled by dependency:      -
    Unavailable:                 cvv freetype hdf java julia matlab ovis python2 python2 sfm viz
    Applications:                tests perf_tests apps
    Documentation:               NO
    Non-free algorithms:         NO

  Windows RT support:            NO

  GUI: 
    Win32 UI:                    YES
    VTK support:                 NO

  Media I/O: 
    ZLib:                        build (ver 1.2.13)
    JPEG:                        build-libjpeg-turbo (ver 2.1.3-62)
      SIMD Support Request:      YES
      SIMD Support:              NO
    WEBP:                        build (ver encoder: 0x020f)
    PNG:                         build (ver 1.6.37)
    TIFF:                        build (ver 42 - 4.2.0)
    JPEG 2000:                   build (ver 2.5.0)
    OpenEXR:                     build (ver 2.3.0)
    HDR:                         YES
    SUNRASTER:                   YES
    PXM:                         YES
    PFM:                         YES

  Video I/O:
    DC1394:                      NO
    FFMPEG:                      YES (prebuilt binaries)
      avcodec:                   YES (58.134.100)
      avformat:                  YES (58.76.100)
      avutil:                    YES (56.70.100)
      swscale:                   YES (5.9.100)
      avresample:                YES (4.0.0)
    GStreamer:                   NO
    DirectShow:                  YES
    Media Foundation:            YES
      DXVA:                      YES

  Parallel framework:            Concurrency

  Trace:                         YES (with Intel ITT)

  Other third-party libraries:
    Intel IPP:                   2021.8 [2021.8.0]
           at:                   D:/gongwu_env/build_cv/build/3rdparty/ippicv/ippicv_win/icv
    Intel IPP IW:                sources (2021.8.0)
              at:                D:/gongwu_env/build_cv/build/3rdparty/ippicv/ippicv_win/iw
    Lapack:                      YES (C:/openblas/lib/openblas.lib)
    Eigen:                       YES (ver 3.4.0)
    Custom HAL:                  NO
    Protobuf:                    build (3.19.1)
    Flatbuffers:                 builtin/3rdparty (23.5.9)

  NVIDIA CUDA:                   YES (ver 11.2, CUFFT CUBLAS NVCUVID NVCUVENC FAST_MATH)
    NVIDIA GPU arch:             61
    NVIDIA PTX archs:

  cuDNN:                         YES (ver 8.6.0)

  OpenCL:                        YES (NVD3D11)
    Include path:                D:/gongwu_env/build_cv/opencv-4.8.0/3rdparty/include/opencl/1.2
    Link libraries:              Dynamic load

  Python 3:
    Interpreter:                 D:/Anaconda3/envs/RailwayAna/python.exe (ver 3.8.8)
    Libraries:                   D:/Anaconda3/libs/python38.lib (ver 3.8.8)
    numpy:                       D:/Anaconda3/envs/RailwayAna/Lib/site-packages/numpy/core/include (ver 1.24.4)
    install path:                D:/Anaconda3/envs/RailwayAna/Lib/site-packages/cv2/python-3.8

  Python (for build):            D:/Anaconda3/envs/RailwayAna/python.exe

  Java:                          
    ant:                         NO
    Java:                        YES (ver 1.8.0.101)
    JNI:                         C:/web/jdk1.8.0_101/include C:/web/jdk1.8.0_101/include/win32 C:/web/jdk1.8.0_101/include
    Java wrappers:               NO
    Java tests:                  NO

  Install to:                    D:/gongwu_env/build_cv/install
-----------------------------------------------------------------

and here is my code:

def test():
    cv2.cuda.setDevice(0)
    video_path = r'D:\gw_test\1129\K335+400.mp4'

    params = cv2.cudacodec.VideoReaderInitParams()
    params.targetSz = (1920, 1080)
    video_gpu = cv2.cudacodec.createVideoReader(video_path, params=params)
    video_gpu.set(cv2.cudacodec.COLOR_FORMAT_BGR)
    format_gpu = video_gpu.format()
    # try:
    # fps = int(format_gpu.fps)

    fps = int(format_gpu.fps)
    print(fps)

    encoder_params_in = cv2.cudacodec.EncoderParams()
    stream = cv2.cuda.Stream()
    os.makedirs(r'D:\test\ceshihanzi', exist_ok=True)
    out = cv2.cudacodec.createVideoWriter(r'D:\test\ceshihanzi\output_.h264', (1920, 1080), cv2.cudacodec.H264, fps=25,
                                          colorFormat=cv2.cudacodec.COLOR_FORMAT_BGR,
                                          params=encoder_params_in,
                                          stream=stream)
    while True:
        ret, frame = video_gpu.nextFrame()
        # print(frame)
        if not ret:
            break

        out.write(frame)
    out.release()

it runs 175 seconds,but same code runs only 17 seconds on another well-building computer, and I don’t know what’s wrong, they both build in the same way, except one’s GPU is 1080 Ti and another is 3080Ti.

and I’m using OpenCV 4.8, cuda 11.2, cudnn 8.6 and Video_Codec_SDK_11.1.5.

plz help me with this, I’ll be very appreciate.

Your transcoding, can you determine whether the issue is in the decoding or encoding part?

How much of the video is processed in both cases?

Do you receive any errors to indicate what the issue is?

Can you share the video?

both, apologize for my misstake, actually 179 is the cost time if I comment out cv2.cudacodec.createVideoWriter or it’ll be 300+. and I don’t receive any errors, it only runs very slow. and same video and code runs well on other computer, so I’m sure video is not the problem. Do you have any idea what can cause this issue, other hardware or what?

I am still not sure what the issue is, you need to provide more detail in your answers. Lets be specific.
From your description it is unclear if the issue is that one machine is slower than the other or that it works on one machine but on the other it fails after 17 seconds

I don’t understand. Do you mean it takes 179 seconds on the one machine and 17 seconds on the other if you only read from the video and you don’t write? If so does it read all the frames on both machines?

sorry for my expression, not a native speaker, I’ll try to be more clear. the issue is one machine works fine, using the code I posted in this thread, with both reading and writing, it runs 17 seconds, and the other machine, using the same code, runs 300+ seconds, but with comment out cv2.cudacodec.createVideoWriter, only reading, it runs 170+ seconds, they both read all frames and both no error encountered, just one is slower than the other, it should not be so slow. I dont know why and dont know what’s wrong, is it my building problem or other hardware problem? or how can I locate the problem?
and I tried change GPU to 3080 Ti and rebuild opencv, but it still runs very slow.

So to confirm:

  1. You have two machines and you have built OpenCV on each of them seperately? i.e. You are not using the exact same library on both machines.
  2. You have tried the same 3080 Ti on both machines and you are getting 17 seconds on one and > 300 on another?
  3. The transcoding is working?
  4. You are only timing the read and write loop not the initialization?

What other differences are there between the two machines, are they running the same Nvidia display drivers, same version of Windows, python, both reading the video from SSD’s etc?

  1. yes, I built them seperately, the 17 seconds one was built first, and the software installer etc was copied from 17 seconds one to the slow one.
  2. yes, I tried same 3080 Ti on both machines and one is much slower.
  3. sorry but how can I know it’s working?
  4. I timed this function
def test():
    cv2.cuda.setDevice(0)
    video_path = r'D:\gw_test\1129\K335+400.mp4'

    params = cv2.cudacodec.VideoReaderInitParams()
    params.targetSz = (1920, 1080)
    video_gpu = cv2.cudacodec.createVideoReader(video_path, params=params)
    video_gpu.set(cv2.cudacodec.COLOR_FORMAT_BGR)
    format_gpu = video_gpu.format()
    # try:
    # fps = int(format_gpu.fps)

    fps = int(format_gpu.fps)
    print(fps)

    encoder_params_in = cv2.cudacodec.EncoderParams()
    stream = cv2.cuda.Stream()
    os.makedirs(r'D:\test\ceshihanzi', exist_ok=True)
    out = cv2.cudacodec.createVideoWriter(r'D:\test\ceshihanzi\output_.h264', (1920, 1080), cv2.cudacodec.H264, fps=25,
                                          colorFormat=cv2.cudacodec.COLOR_FORMAT_BGR,
                                          params=encoder_params_in,
                                          stream=stream)
    while True:
        ret, frame = video_gpu.nextFrame()
        # print(frame)
        if not ret:
            break

        out.write(frame)
    out.release()

and I timed every frame reading

nvidia drivers are different, one’s 536.25, the slower one is 536.23 at first, and I updeted it to 560.81. I use conda python env, they’re the same, windows version are different, I tried read video from SSD or HDD both on the slower one, but made no difference. I have tried everything I can do, still can’t locate the problem.

If the video file you have written plays correctly.

One possibility is the driver may be choosing a lower number of default decode surfaces on the one machine.

Try setting

params.minNumDecodeSurfaces = 10

If that doesn’t work send me the video file with and the exact code you are using for timing and I’ll take a look.

params.minNumDecodeSurfaces = 10

this works, thank you very much! And can I know why? why the first one dont need to set this params, and what may cause this issue? or am I doing something wrong or missing something while buliding it?

1 Like

That’s great!

I’m not sure I suspect that one machine is using less decode surfaces by default than the other. You can check this by examining the

format_gpu.ulNumDecodeSurfaces

output on both machines when you don’t first set params.minNumDecodeSurfaces.

If they are both the same for the video your are using then I would guess its a hardware difference which is overcome by using more decode surfaces but I would need access to both machines to really know whats going on.

For more info you can read the Nvidia Video Codec SDK Docs, specifically the sections on CUVIDDECODECREATEINFO::ulNumDecodeSurfaces

The following steps should be followed for optimizing video memory usage:

  1. Make CUVIDDECODECREATEINFO::ulNumDecodeSurfaces = CUVIDEOFORMAT:: min_num_decode_surfaces. This will ensure that the underlying driver allocates minimum number of decode surfaces to correctly decode the sequence. In case there is reduction in decoder performance, clients can slightly increase CUVIDDECODECREATEINFO::ulNumDecodeSurfaces. It is therefore recommended to choose the optimal value of CUVIDDECODECREATEINFO::ulNumDecodeSurfaces to ensure right balance between decoder throughput and memory consumption.
  2. CUVIDDECODECREATEINFO::ulNumOutputSurfaces should be decided optimally after due experimentation for balancing decoder throughput and memory consumption.

If you want to see how much memory more decode surfaces uses then this might help aswell.

Thank you again! I checked the format_gpu.ulNumDecodeSurfaces values, they’re different, one is 25 and the other is only 5. I’ll check these documents later. and Thank you again for your help, I appreciate it a lot.

1 Like