Status and usage of cudacodec::VideoWriter

Hi,
I am interested in efficiently encoding and writing video files from images stored on the GPU from python. It seems like the cv::cudacodec::VideoWriter class has recently been revived but there isn’t much documentation about it.

I saw the c++ sample here: opencv/samples/gpu/video_writer.cpp at 4.x · opencv/opencv · GitHub
However, I was keen to get it in python. Below is what I tried but the resulting mp4 file does not play with vlc. Any hint on what I am doing wrong?

import numpy as np
import matplotlib.pyplot as plt
import imageio.v3 as iio
import cv2

# Fetch a test image
im = iio.imread('https://upload.wikimedia.org/wikipedia/commons/b/b6/PM5644-1920x1080.gif')
img = im.squeeze()

plt.imshow(img)

# Convert test image to GpuMat
img_gpu = cv2.cuda_GpuMat(img)
print("img_gpu:", img_gpu, img_gpu.size())

# Create video writer
cf=cv2.cudacodec.COLOR_FORMAT_RGB
videowriter = cv2.cudacodec.createVideoWriter(
    'output.mp4', 
    frameSize=img_gpu.size(), 
    codec=cv2.cudacodec.HEVC, 
    fps=30, 
    colorFormat=cf)

# Encode the same image many times
for i in range(100):
  videowriter.write(img_gpu)

# Clean up
videowriter.release()

If relevant, I am using the wheel from github:cudawarped/opencv-python-cuda-wheels/releases/tag/4.9.0.80

You’re not doing anything wrong, your output.mp4 is not an mp4 file its an incorrectly named h265 file. If you rename it to output.h265 vlc will play it.

The update on windows to allow writing to container formats (e.g. mp4) was not included until after the wheel you are using was built

If I get chance later I will build an updated wheel which includes this feature.

1 Like

Ok, thanks. I am looking forward to the new wheel to be able to easily test writting to a mp4 container.

On a different but related note (sorry for hijacking the thread), are there any plans to support 10bit encoding through cv::cudacodec? I am ultimately interested in this feature (still with GPU data) but am yet to find an option. For example, this is also lacking with torchaudio:

and there is no container support in VPF:

Currently no because OpenCV doesn’t naitively support this format. That said it may be fairly straightforward to implement. How would you be passing the data (10bit RGB) and do you have a sample?

Nice to hear it could be fairly straightforward to implement! The data would be stored as 16 bit on the OpenCV side with the expectation that the actual maximum value in th GpuMat be under 1023. Alternatively, the 10 bit encoding could use only the most signicant bits in the 16 bit input but I find this a bit more counter-intuitive.

Should I file a feature request?

Here is a simple function I often use to generate a 10 bit grayscale image:

def make_test_im():
  # Create simple image with  gradient from
  # 0 to (2^bitdepth - 1)
  bitdepth = 10
  unusedbitdepth = 16-bitdepth
  hbd = int(bitdepth/2)
  im = np.zeros((1<<hbd,1<<hbd),dtype=np.uint16)
  im[:] = np.arange(0,1<<bitdepth).reshape(im.shape)

  # Tile it to be at least 64 pix as ffmpeg encoder may only work
  # with image of size 64 and up
  numreps = 5
  im = np.tile(im, (numreps, numreps))
  print('im',np.min(im),np.max(im),im.shape,im.dtype)
  return im

NVENC uses a packed 10bit as an input format. This is specified in nvEncodeAPI.h, excert below

NV_ENC_BUFFER_FORMAT_ARGB10                          = 0x02000000,  /**< 10 bit Packed A2R10G10B10. This is a word-ordered format
                                                                             where a pixel is represented by a 32-bit word with B
                                                                             in the lowest 10 bits, G in the next 10 bits, R in the
                                                                             10 bits after that and A in the highest 2 bits. */

This would require an additional CUDA kernel to convert the 16bit input to the correct format making it less straighforward. If the input was in the format specified above then I “think” the required modifications to accomidate this would be small.

You have nothing to lose, a community member may implement this feature but I wouldn’t hold your breath. If on the other hand the video input was in the requried format (NV_ENC_BUFFER_FORMAT_ARGB10) I may take a look when I have time.

Thanks. Having support for inputs that needs to be in packed 10 bit format would be already great. Finding a way to do the bit packing independently shouldn’t be too hard to achive. It could maybe even be done in python directly with say cupy:
https://docs.cupy.dev/en/latest/reference/generated/cupy.unpackbits.html
https://docs.cupy.dev/en/latest/reference/generated/cupy.packbits.html

Would the most convenient input then be presented as uint32?

Without giving it a lot of thought I would think float32 from numpy should be the most convenient because a numpy array of this type can be uploaded directly to a GpuMat.

Will you be processing the data on the GPU before hand or uploading everything from the CPU. If its the latter are you sure cudacodec::VideoWriter is faster than cv::VideoWriter for your workflow?

If you had a single 10 bit packed frame representing an image then I could take a look.

I’ve uploaded a new wheel which should allow you to write to .mp4. Let me know if you have any issues.

1 Like

After second thought, I think I should have mentioned that 10 bit YUV is probably better than 10 bit RGB in my case. While it seemed somewhat insignificant initially, it could actually make things much simpler as nvenc expects 10 bit YUV to be provided as 3 x 16 bit data and uses only the most significant bits:

    NV_ENC_BUFFER_FORMAT_YUV420_10BIT                    = 0x00010000,  /**< 10 bit Semi-Planar YUV [Y plane followed by interleaved UV plane]. Each pixel of size 2 bytes. Most Significant 10 bits contain pixel data. */
    NV_ENC_BUFFER_FORMAT_YUV444_10BIT                    = 0x00100000,  /**< 10 bit Planar YUV444 [Y plane followed by U and V planes]. Each pixel of size 2 bytes. Most Significant 10 bits contain pixel data.  */

Would supporting NV_ENC_BUFFER_FORMAT_YUV420_10BIT indeed be esier?

Many thanks! I don’ have access to a cuda environement right now but I will test as soon as.

Yes indeed, the data will be on the GPU already, most likely pytorch. I was planning to make use of the feature you implemented here to help with interoperability:

Again I haven’t given this a lot of thought but both should be equivelent from the persepective of the GpuMat and cudacodec::VideoWriter classes. NV_ENC_BUFFER_FORMAT_ARGB10 would map to CV_32FC1 and NV_ENC_BUFFER_FORMAT_YUV420_10BIT to CV_16SC3.

If on the other hand the comparisson is between NV_ENC_BUFFER_FORMAT_YUV420_10BIT and 16 bit RGB then the former should be much easier to deal with because there would be no internal conversion. Additionaly if 16 bit RGB is not a standard format it doesn’t make sense to write a specific conversion routine for it inside OpenCV.

Thanks. I filed a feature request here:

On the python wheel side, I just realised I could use the one you kindly provided as it is for windows rather than linux.

It seems like a small change, if you are using windows and you let me know the compute capability of your GPU I can build you a wheel to test?

I am performing these tests on google colab for now which runs on ubuntu rather than windows. My actual target is running a similar environment.

> nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
7.5
> nvidia-smi
Fri Apr 26 15:11:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8               9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
> !python -m torch.utils.collect_env
[...]
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.27.9
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.58+-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.104.05
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             2
On-line CPU(s) list:                0,1
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) CPU @ 2.00GHz
[...]

If you are on linux I don’t understand why you were getting the original error, have you confirmed this is fixed?

I used the latest linux wheel I could see on your release assets:

Unless I miss something obvious (which is very much possible), it looks like this was built without ffmpeg support:

!pip install --upgrade --force-reinstall https://github.com/cudawarped/opencv-python-cuda-wheels/releases/download/4.9.0.80/opencv_contrib_python-4.9.0.80-cp37-abi3-linux_x86_64.whl
import cv2
print(cv2.getBuildInformation())
Collecting opencv-contrib-python==4.9.0.80
  Downloading https://github.com/cudawarped/opencv-python-cuda-wheels/releases/download/4.9.0.80/opencv_contrib_python-4.9.0.80-cp37-abi3-linux_x86_64.whl (314.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 314.1/314.1 MB 4.5 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.21.2 in /usr/local/lib/python3.10/dist-packages (from opencv-contrib-python==4.9.0.80) (1.26.4)

General configuration for OpenCV 4.9.0 =====================================
  Version control:               4.9.0

  Extra modules:
    Location (extra):            /home/b/repos/opencv/opencv-python/opencv_contrib/modules
    Version control (extra):     4.9.0

  Platform:
    Timestamp:                   2024-01-08T14:49:26Z
    Host:                        Linux 5.15.133.1-microsoft-standard-WSL2 x86_64
    CMake:                       3.22.1
    CMake generator:             Ninja
    CMake build tool:            /usr/bin/ninja
    Configuration:               Release

  CPU/HW features:
    Baseline:                    SSE SSE2 SSE3
      requested:                 SSE3
    Dispatched code generation:  SSE4_1 SSE4_2 FP16 AVX AVX2 AVX512_SKX
      requested:                 SSE4_1 SSE4_2 AVX FP16 AVX2 AVX512_SKX
      SSE4_1 (16 files):         + SSSE3 SSE4_1
      SSE4_2 (1 files):          + SSSE3 SSE4_1 POPCNT SSE4_2
      FP16 (0 files):            + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 AVX
      AVX (8 files):             + SSSE3 SSE4_1 POPCNT SSE4_2 AVX
      AVX2 (36 files):           + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2
      AVX512_SKX (5 files):      + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2 AVX_512F AVX512_COMMON AVX512_SKX

  C/C++:
    Built as dynamic libs?:      NO
    C++ standard:                11
    C++ Compiler:                /usr/bin/c++  (ver 11.4.0)
    C++ flags (Release):         -fsigned-char -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -O3 -DNDEBUG  -DNDEBUG
    C++ flags (Debug):           -fsigned-char -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -g  -O0 -DDEBUG -D_DEBUG
    C Compiler:                  /usr/bin/cc
    C flags (Release):           -fsigned-char -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -O3 -DNDEBUG  -DNDEBUG
    C flags (Debug):             -fsigned-char -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -g  -O0 -DDEBUG -D_DEBUG
    Linker flags (Release):      -Wl,--exclude-libs,libippicv.a -Wl,--exclude-libs,libippiw.a   -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined  
    Linker flags (Debug):        -Wl,--exclude-libs,libippicv.a -Wl,--exclude-libs,libippiw.a   -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined  
    ccache:                      NO
    Precompiled headers:         NO
    Extra dependencies:          /usr/lib/x86_64-linux-gnu/libz.so /usr/lib/wsl/lib/libcuda.so /usr/lib/wsl/lib/libnvcuvid.so /usr/lib/wsl/lib/libnvidia-encode.so Iconv::Iconv m pthread cudart_static dl rt nppc nppial nppicc nppidei nppif nppig nppim nppist nppisu nppitc npps cublas cudnn cufft -L/usr/local/cuda/lib64 -L/usr/lib/x86_64-linux-gnu
    3rdparty dependencies:       libprotobuf ade ittnotify libjpeg-turbo libwebp libpng libtiff libopenjp2 IlmImf ippiw ippicv

  OpenCV modules:
    To be built:                 aruco bgsegm bioinspired calib3d ccalib core cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev datasets dnn dnn_objdetect dnn_superres dpm face features2d flann fuzzy gapi hfs highgui img_hash imgcodecs imgproc intensity_transform line_descriptor mcc ml objdetect optflow phase_unwrapping photo plot python3 quality rapid reg rgbd saliency shape stereo stitching structured_light superres surface_matching text tracking video videoio videostab wechat_qrcode xfeatures2d ximgproc xobjdetect xphoto
    Disabled:                    world
    Disabled by dependency:      -
    Unavailable:                 alphamat cannops cvv freetype hdf java julia matlab ovis python2 sfm ts viz
    Applications:                -
    Documentation:               NO
    Non-free algorithms:         NO

  GUI:                           NONE
    GTK+:                        NO
    VTK support:                 NO

  Media I/O: 
    ZLib:                        /usr/lib/x86_64-linux-gnu/libz.so (ver 1.2.11)
    JPEG:                        libjpeg-turbo (ver 2.1.3-62)
    WEBP:                        build (ver encoder: 0x020f)
    PNG:                         build (ver 1.6.37)
    TIFF:                        build (ver 42 - 4.2.0)
    JPEG 2000:                   build (ver 2.5.0)
    OpenEXR:                     build (ver 2.3.0)
    HDR:                         YES
    SUNRASTER:                   YES
    PXM:                         YES
    PFM:                         YES

  Video I/O:
    DC1394:                      NO
    FFMPEG:                      NO
      avcodec:                   NO
      avformat:                  NO
      avutil:                    NO
      swscale:                   NO
      avresample:                NO
    GStreamer:                   NO
    v4l/v4l2:                    YES (linux/videodev2.h)

  Parallel framework:            pthreads

  Trace:                         YES (with Intel ITT)

  Other third-party libraries:
    Intel IPP:                   2021.10.0 [2021.10.0]
           at:                   /home/b/repos/opencv/opencv-python/_skbuild/linux-x86_64-3.7/cmake-build/3rdparty/ippicv/ippicv_lnx/icv
    Intel IPP IW:                sources (2021.10.0)
              at:                /home/b/repos/opencv/opencv-python/_skbuild/linux-x86_64-3.7/cmake-build/3rdparty/ippicv/ippicv_lnx/iw
    VA:                          NO
    Lapack:                      NO
    Eigen:                       NO
    Custom HAL:                  NO
    Protobuf:                    build (3.19.1)
    Flatbuffers:                 builtin/3rdparty (23.5.9)

  NVIDIA CUDA:                   YES (ver 12.3, CUFFT CUBLAS NVCUVID NVCUVENC)
    NVIDIA GPU arch:             50 52 53 60 61 62 70 72 75 80 86 87 89 90
    NVIDIA PTX archs:            90

  cuDNN:                         YES (ver 8.9.7)

  OpenCL:                        YES (no extra features)
    Include path:                /home/b/repos/opencv/opencv-python/opencv/3rdparty/include/opencl/1.2
    Link libraries:              Dynamic load

  Python 3:
    Interpreter:                 /home/b/miniforge3/envs/py37/bin/python (ver 3.7.12)
    Libraries:                   /home/b/miniforge3/envs/py37/lib/libpython3.7m.so (ver 3.7.12)
    numpy:                       /home/b/miniforge3/envs/py37/lib/python3.7/site-packages/numpy/core/include (ver 1.21.6)
    install path:                python/cv2/python-3

  Python (for build):            /home/b/miniforge3/envs/py37/bin/python

  Java:                          
    ant:                         NO
    Java:                        NO
    JNI:                         NO
    Java wrappers:               NO
    Java tests:                  NO

  Install to:                    /home/b/repos/opencv/opencv-python/_skbuild/linux-x86_64-3.7/cmake-install
-----------------------------------------------------------------

Your correct I neglected to install ffmpeg first. I guess you were running it in a notebook and didn’t get the following error

terminating with uncaught exception of type cv::Exception: OpenCV(4.9.0) /home/b/repos/opencv/opencv-python/opencv_contrib/modules/cudacodec/src/video_writer.cpp:83: error: (-213:The function/feature is not implemented) FFmpeg backend not found in function ‘FFmpegVideoWriter’

I will upload some new ubuntu wheels when I have chance.

@tvercaut If you have time can you test

The test below demonstrates how to use it

import cv2 as cv
import numpy as np
src = 'big_buck_bunny.mp4'
reader = cv.cudacodec.createVideoReader(src)
reader.set(cv.cudacodec.ColorFormat_NV_NV12)
fmt = reader.format()
stream = cv.cuda.Stream()
writer = cv.cudacodec.createVideoWriter('output.mp4', fmt.targetSz, codec=cv.cudacodec.HEVC, fps=30,
                                        colorFormat=cv.cudacodec.ColorFormat_NV_YUV410_10BIT, stream=stream)
ret, frame_nv12 = reader.nextFrame()
frame_nv12_16bit = cv.cuda_GpuMat(frame_nv12.size(), cv.CV_16U)
frame_yuv410_10bit = cv.cuda_GpuMat(frame_nv12.size(),cv.CV_16U)
for i in range(100):
    frame_nv12.convertTo(cv.CV_16U, stream, dst=frame_nv12_16bit)
    cv.cuda.lshift(frame_nv12_16bit, 8, dst=frame_yuv410_10bit, stream=stream)
    writer.write(frame_yuv410_10bit)
    reader.nextFrame(frame_nv12, stream=stream)
writer.release()

Thanks, that looks great already! Your snippet mostly works for me although I found a few issues.

  1. A small typo I assume but ColorFormat_NV_YUV410_10BIT should be ColorFormat_NV_YUV420_10BIT

  2. The resulting video is readble in vlc but not on Mac OS Quicktime. This is probably related to the choice of fourcc as Quicktime is a bit peculiar with this and requires hvc1 instead of hev1:
    opencv_contrib/modules/cudacodec/src/video_writer.cpp at 3c2bcbfe8374edaf3eb756b560374244538d57b6 · opencv/opencv_contrib · GitHub
    Could a way of changing the fourcc be exposed to the user?

  3. The corresponding reading of 10bit YUV420 with cudacodec doesn’t seem to work. If I specify reader.set(cv.cudacodec.ColorFormat_NV_YUV410_10BIT) instead of reader.set(cv.cudacodec.ColorFormat_NV_NV12), the frames from the reader have 4 uint8 channels and the size of the video rather than what you currently expect for the writer (1 channel, frame height of 1.5x the video height, uint16 element type)

For testing purposes, here is how to generate a test sample:

ffmpeg -hide_banner -loglevel error -stats -f lavfi -i testsrc=duration=10:size=1280x720:rate=30 -y -pix_fmt yuv420p10le -c:v libx265 -x265-params log-level=warning -tag:v hvc1 testsrc-hevc-yuv420p10le.mp4

I haven’t implemented that change yet. I am still not sure if outputing that format from the video reader is realvent to OpenCV as no routines (that I know of) can process it. What is your use case in OpenCV? I assumed you were processing footage from a sensor and wanted to archive the footage in a higher precision not that you needed to read it back into OpenCV?