OpenCV 4.6 cross-compilation with CUDA for Jetson Xavier NX

Hello,

I’m trying to cross-compile OpenCV 4.6 with CUDA support for the Jetson Xavier NX.

My compilation environment is WSL2, where I have CUDA compilation tools, release 12.3, V12.3.52.

On my Jetson Xavier I have JetPack 5.1.3/L4T 35.5.0 with CUDAsupport (CUDA 11.4).

Apparently the compilation works correctly, without any errors.
For this I use:

cmake \
-D CUDA_VERBOSE_BUILD=ON \
-D CUDA_ARCH_BIN=7.0,7.2 \
-D CUDA_ARCH_PTX="" \
-D CUDA_NVCC_FLAGS="-D_FORCE_INLINES" \
-D CUDA_BUILD_EMULATION=OFF \
-D WITH_CUDA=ON \
-D BUILD_EXAMPLES=OFF \
-d BUILD_opencv_apps=OFF \
-D INSTALL_C_EXAMPLES=OFF \
-D INSTALL_TESTS=OFF \
-D CMAKE_BUILD_TYPE=Release \
-D CMAKE_INSTALL_PREFIX_INITIALIZED_TO_DEFAULT=true \
-D CMAKE_TOOLCHAIN_FILE=../.../jetson.toolchain.cmake
-D OPENCV_EXTRA_MODULES_PATH=../opencv_contrib/modules \D CMAKE_CROSS=../../jetson.toolchain.cmake
-D CMAKE_CROSSCOMPILING=true \
-D CUDA_TOOLKIT_TARGET_DIR=../.../Linux_for_Tegra/rootfs/usr/local/cuda-11.4/ \
-D CUDA_HOST_COMPILER=../../../../aarch64--glibc--stable-final/bin/aarch64-buildroot-linux-gnu-g++ \
../opencv
-- General configuration for OpenCV 4.6.0 =====================================
--   Version control:               4.6.0-dirty
-- 
--   Extra modules:
--     Location (extra):            /home/afr/opencv/opencv_contrib/modules
--     Version control (extra):     4.6.0
-- 
--   Platform:
--     Timestamp:                   2024-06-18T15:04:50Z
--     Host:                        Linux 5.15.74.2-microsoft-standard-WSL2+ x86_64
--     Target:                      Linux aarch64
--     CMake:                       3.16.3
--     CMake generator:             Unix Makefiles
--     CMake build tool:            /usr/bin/make
--     Configuration:               Release
-- 
--   CPU/HW features:
--     Baseline:                    NEON FP16
--       required:                  NEON
--       disabled:                  VFPV3
-- 
--   C/C++:
--     Built as dynamic libs?:      YES
--     C++ standard:                11
--     C++ Compiler:                /home/afr/aarch64--glibc--stable-final/bin/aarch64-buildroot-linux-gnu-g++  (ver 9.3.0)
--     C++ flags (Release):         -fsigned-char -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -fvisibility-inlines-hidden -O3 -DNDEBUG  -DNDEBUG
--     C++ flags (Debug):           -fsigned-char -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -fvisibility-inlines-hidden -g  -O0 -DDEBUG -D_DEBUG
--     C Compiler:                  /home/afr/aarch64--glibc--stable-final/bin/aarch64-buildroot-linux-gnu-gcc
--     C flags (Release):           -fsigned-char -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -O3 -DNDEBUG  -DNDEBUG
--     C flags (Debug):             -fsigned-char -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -g  -O0 -DDEBUG -D_DEBUG
--     Linker flags (Release):      -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined  
--     Linker flags (Debug):        -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined  
--     ccache:                      NO
--     Precompiled headers:         NO
--     Extra dependencies:          dl m pthread rt cudart nppc nppial nppicc nppidei nppif nppig nppim nppist nppisu nppitc npps cublas cufft -L/home/afr/Linux_for_Tegra/rootfs/usr/local/cuda-11.4/lib64
--     3rdparty dependencies:
-- 
--   OpenCV modules:
--     To be built:                 aruco barcode bgsegm bioinspired calib3d ccalib core cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev datasets dnn dnn_objdetect dnn_superres dpm face features2d flann fuzzy gapi hfs highgui img_hash imgcodecs imgproc intensity_transform line_descriptor mcc ml objdetect optflow phase_unwrapping photo plot quality rapid reg rgbd saliency shape stereo stitching structured_light superres surface_matching text tracking ts video videoio videostab wechat_qrcode xfeatures2d ximgproc xobjdetect xphoto
--     Disabled:                    world
--     Disabled by dependency:      -
--     Unavailable:                 alphamat cvv freetype hdf java julia matlab ovis python2 python3 sfm viz
--     Applications:                tests perf_tests
--     Documentation:               NO
--     Non-free algorithms:         NO
-- 
--   GUI:                           NONE
--     GTK+:                        NO
-- 
--   Media I/O: 
--     ZLib:                        zlib (ver 1.2.12)
--     JPEG:                        libjpeg-turbo (ver 2.1.2-62)
--     WEBP:                        build (ver encoder: 0x020f)
--     PNG:                         build (ver 1.6.37)
--     TIFF:                        build (ver 42 - 4.2.0)
--     JPEG 2000:                   build (ver 2.4.0)
--     HDR:                         YES
--     SUNRASTER:                   YES
--     PXM:                         YES
--     PFM:                         YES
-- 
--   Video I/O:
--     DC1394:                      NO
--     FFMPEG:                      NO
--       avcodec:                   NO
--       avformat:                  NO
--       avutil:                    NO
--       swscale:                   NO
--       avresample:                NO
--     GStreamer:                   NO
--     v4l/v4l2:                    YES (linux/videodev2.h)
-- 
--   Parallel framework:            pthreads
-- 
--   Trace:                         YES (with Intel ITT)
-- 
--   Other third-party libraries:
--     Lapack:                      NO
--     Custom HAL:                  YES (carotene (ver 0.0.1))
--     Protobuf:                    build (3.19.1)
-- 
--   NVIDIA CUDA:                   YES (ver 12.3, CUFFT CUBLAS)
--     NVIDIA GPU arch:             70
--     NVIDIA PTX archs:
-- 
--   cuDNN:                         NO
-- 
--   OpenCL:                        YES (no extra features)
--     Include path:                /home/afr/opencv/opencv/3rdparty/include/opencl/1.2
--     Link libraries:              Dynamic load
-- 
--   Python (for build):            /usr/bin/python3
-- 
--   Install to:                    /home/afr/opencv/build/install
-- -----------------------------------------------------------------
-- 
-- Configuring done
-- Generating done
-- Build files have been written to: /home/afr/opencv/build

The compilation of OpenCV using only C++ has worked correctly, because when I try a simple example using only a cv::Mat everything works correctly.

When I test an isolated example using basic C++ and CUDA libraries without the OpenCV interface it also works fine.

The problem is when I want some OpenCV option that uses CUDA, for example cv::cuda::GpuMat. In this case I automatically get the error “…device kernel image is invalid in function…”

The architecture of the Xavier NX is Volta, which supports 7.0 and 7.2 so I understand that the error is not in the architecture version.

Could someone give me some indication of the error?

The architecture is Volta but the compute capability is 7.2

Thank you for your response!

Indeed the Jetson Xavier NV is compatible with 7.2. But even if I only specify 7.2 in the compilation, the problem persists.

Does anyone have any idea what could be the error?
Any difference that I missed when compiling the library that in the case of applications is not necessary?

Thanks!

That’s strange as the error you are getting indicates there is no suitable (cc 7.2) device code available.

Can you confirm the output from getBuildInformation() confirms that OpenCV includes device code for cc 7.2?

Below is the output of getBuildInformation() executed on my Jetson Xavier NX target.
As you can see the GPU architecture is 7.2

General configuration for OpenCV 4.6.0 =====================================
  Version control:               4.6.0-dirty

  Extra modules:
    Location (extra):            /home/afr/opencv/opencv_contrib/modules
    Version control (extra):     4.6.0

  Platform:
    Timestamp:                   2024-06-21T09:51:24Z
    Host:                        Linux 5.15.74.2-microsoft-standard-WSL2+ x86_64
    Target:                      Linux aarch64
    CMake:                       3.16.3
    CMake generator:             Unix Makefiles
    CMake build tool:            /usr/bin/make
    Configuration:               Release

  CPU/HW features:
    Baseline:                    NEON FP16
      required:                  NEON
      disabled:                  VFPV3

  C/C++:
    Built as dynamic libs?:      YES
    C++ standard:                11
    C++ Compiler:                /home/afr/aarch64--glibc--stable-final/bin/aarch64-buildroot-linux-gnu-g++  (ver 9.3.0)
    C++ flags (Release):         -fsigned-char -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -fvisibility-inlines-hidden -O3 -DNDEBUG  -DNDEBUG
    C++ flags (Debug):           -fsigned-char -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -fvisibility-inlines-hidden -g  -O0 -DDEBUG -D_DEBUG
    C Compiler:                  /home/afr/aarch64--glibc--stable-final/bin/aarch64-buildroot-linux-gnu-gcc
    C flags (Release):           -fsigned-char -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -O3 -DNDEBUG  -DNDEBUG
    C flags (Debug):             -fsigned-char -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -g  -O0 -DDEBUG -D_DEBUG
    Linker flags (Release):      -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined
    Linker flags (Debug):        -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined
    ccache:                      NO
    Precompiled headers:         NO
    Extra dependencies:          dl m pthread rt cudart nppc nppial nppicc nppidei nppif nppig nppim nppist nppisu nppitc npps cublas cufft -L/home/afr/rootfs/usr/local/cuda-11.4/lib64
    3rdparty dependencies:

  OpenCV modules:
    To be built:                 aruco barcode bgsegm bioinspired calib3d ccalib core cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev datasets dnn dnn_objdetect dnn_superres dpm face features2d flann fuzzy gapi hfs highgui img_hash imgcodecs imgproc intensity_transform line_descriptor mcc ml objdetect optflow phase_unwrapping photo plot quality rapid reg rgbd saliency shape stereo stitching structured_light superres surface_matching text tracking ts video videoio videostab wechat_qrcode xfeatures2d ximgproc xobjdetect xphoto
    Disabled:                    world
    Disabled by dependency:      -
    Unavailable:                 alphamat cvv freetype hdf java julia matlab ovis python2 python3 sfm viz
    Applications:                tests perf_tests
    Documentation:               NO
    Non-free algorithms:         NO

  GUI:                           NONE
    GTK+:                        NO

  Media I/O:
    ZLib:                        zlib (ver 1.2.12)
    JPEG:                        libjpeg-turbo (ver 2.1.2-62)
    WEBP:                        build (ver encoder: 0x020f)
    PNG:                         build (ver 1.6.37)
    TIFF:                        build (ver 42 - 4.2.0)
    JPEG 2000:                   build (ver 2.4.0)
    HDR:                         YES
    SUNRASTER:                   YES
    PXM:                         YES
    PFM:                         YES

  Video I/O:
    DC1394:                      NO
    FFMPEG:                      NO
      avcodec:                   NO
      avformat:                  NO
      avutil:                    NO
      swscale:                   NO
      avresample:                NO
    GStreamer:                   NO
    v4l/v4l2:                    YES (linux/videodev2.h)

  Parallel framework:            pthreads

  Trace:                         YES (with Intel ITT)

  Other third-party libraries:
    Lapack:                      NO
    Custom HAL:                  YES (carotene (ver 0.0.1))
    Protobuf:                    build (3.19.1)

  NVIDIA CUDA:                   YES (ver 12.3, CUFFT CUBLAS)
    NVIDIA GPU arch:             72
    NVIDIA PTX archs:

  cuDNN:                         NO

  OpenCL:                        YES (no extra features)
    Include path:                /home/afr/opencv/opencv/3rdparty/include/opencl/1.2
    Link libraries:              Dynamic load

  Python (for build):            /usr/bin/python3

  Install to:                    /home/afr/opencv/build/install
-----------------------------------------------------------------

Thanks!

Is it the same error? Does your driver support CUDA runtime >= 12.0 (>=525.60.13)?

Hi,
I don’t think the driver is the problem.
By default, in the target I have installed the driver included inside NVIDIA Jetson Linux 35.5.0.
On my host I have NVIDIA-SMI 525.104, Driver Version: 528.79 and CUDA Version: 12.0.

I understand that the problem is not related to host-target compatibility as I can cross-compile a simple example using cuda_runtime with several calls to cudaMalloc/cudaMemcpy/cudaFree and it works correctly.

Isn’t that the driver for CUDA 11.4. From the docs that is not compatible with CUDA 12.0

That said if you can call cross compiled CUDA >= 12.0 runtime functions on your Jetson that might not be the issue.

Try building one of the sample kernels from Nvidia on your Jetson (should be quick) to see if it works. Then try cross compiling and see if it works.

OpenCV is not picking this up as can be seen by

--   NVIDIA CUDA:                   YES (ver 12.3, CUFFT CUBLAS)

I suspect you want to use this version of the CUDA toolkit (i.e. not 12.3 from your host machine). Does it pick up the correct location if you remove the extra “.” i.e. ../../Lor specify the location using

-DCUDA_TOOLKIT_ROOT_DIR=../../Linux_for_Tegra/rootfs/usr/local/cuda-11.4/