Contributing using CUDA - versions and libraries

TumoiYorozu · December 24, 2023, 9:21am

Hello all. I am considering contributing an update to the median filter in OpenCV (4.x) using CUDA.

At SIGGRAPH Asia 2022, a method was presented that, according to the paper, achieves more than a tenfold speedup over the current histogram-based method. In addition, the current OpenCV only supports up to 5x5 windows for 16-bit or float data, but this new method also supports larger windows for HDR.

Paper’s project page: Constant Time Median Filter using 2D Wavelet Matrix | Interactive Graphics & Engineering Lab
My implementation (as author): GitHub - TumoiYorozu/WMatrixMedian: [SIGGRAPH Asia 2022 Technical Papers' Best Paper Award]

I am thinking about offering this implementation for OpenCV, but I have questions about the CUDA version and libraries that can be used. The author’s current implementation uses the CUDA library CUB. As far as I can tell from examining the OpenCV project, there does not seem to be any code currently using CUB. Since CUB has been included in the standard toolkit since CUDA 11.0, it seems possible to port this as is, if OpenCV’s implementation is fine with using CUDA 11 or higher.

What version of CUDA should I use for OpenCV 4.x? Is CUB usable? Would it be necessary to branch the code depending on whether CUB is available for older environments?

Also, NVIDIA has released many GPUs over the years. From which generation on should they be supported? Kepler (3.5)? Maxwell (5.x)?

cudawarped · December 25, 2023, 6:29am

If you have time that would be great!

This is correct as far as I know but there are implementations of warp and block level reductions/scans. e.g.

github.com

opencv/opencv/blob/4884083019c3378a84b8eafd16a4aacbcec081e9/modules/core/include/opencv2/core/cuda/scan.hpp#L218


      
          #endif
          }
          
          template <typename T>
          __device__ __forceinline__ T warpScanExclusive(T idata, volatile T* s_Data, unsigned int tid)
          {
              return warpScanInclusive(idata, s_Data, tid) - idata;
          }
          
          template <int tiNumScanThreads, typename T>
          __device__ T blockScanInclusive(T idata, volatile T* s_Data, unsigned int tid)
          {
              if (tiNumScanThreads > OPENCV_CUDA_WARP_SIZE)
              {
                  //Bottom-level inclusive warp scan
                  T warpResult = warpScanInclusive(idata, s_Data, tid);
          
                  //Save top elements of each warp for exclusive warp scan
                  //sync to wait for warp scans to complete (because s_Data is being overwritten)
                  __syncthreads();
                  if ((tid & (OPENCV_CUDA_WARP_SIZE - 1)) == (OPENCV_CUDA_WARP_SIZE - 1))

So you could use those and if you want update them to use CUB when the CUDA Toolkit >= 11.

If your improvement to an existing algorithm relies on features which are only available on newer hardware then you can selectively compile those modifications in when compiling for that architechture using __CUDA_ARCH__. e.g.

github.com

opencv/opencv/blob/4884083019c3378a84b8eafd16a4aacbcec081e9/modules/dnn/src/cuda/atomics.hpp#L14


      
          
          #ifndef OPENCV_DNN_SRC_CUDA_ATOMICS_HPP
          #define OPENCV_DNN_SRC_CUDA_ATOMICS_HPP
          
          #include <cuda_runtime.h>
          #include <cuda_fp16.h>
          
          // The 16-bit __half floating-point version of atomicAdd() is only supported by devices of compute capability 7.x and higher.
          // This function was introduced in CUDA 10.
          // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd
          #if !defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 700 && CUDART_VERSION >= 10000)
          // And half-precision floating-point operations are not supported by devices of compute capability strictly lower than 5.3
          // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
          #elif __CUDA_ARCH__ < 530
          #else
          inline __device__ void atomicAdd(__half* address, __half val) {
              unsigned int* address_as_ui = (unsigned int *)((char *)address - ((size_t)address & 2));
              unsigned int old = *address_as_ui;
              unsigned int assumed;
          
              do {

crackwitz · December 25, 2023, 11:13am

It’s fine to discuss this on the forum but you’ll probably want to open an issue and discuss this on the github. That is where all the library work happens. And where all the paid core maintainers are to be found

TumoiYorozu · December 25, 2023, 1:03pm

Thanks cudawarped for the quick reply. I consider all my current technical questions answered.
I never knew that there are primitives within OpenCV that also provide scan, etc. for CUDA.
I will use these and do some refactoring to create a PR.

TumoiYorozu · December 25, 2023, 1:06pm

Thank you for pointing this out, crackwitz.
I also checked the recent issues on GitHub, but most of them are about bugs, etc. When I opened New Issue, I was confused with “Feature request”, but I thought my question would be included in “Questions”, so I decided to go with “forum”.
I will use GitHub issues next time.

Topic		Replies	Views
Opencv and Cuda comaptible versions cuda	1	1960	April 8, 2024
OpenCV with CUDA support – unsupported visual studio version build errors C++ windows , build , cuda	4	269	March 26, 2025
Unable to create OpenCV.sln that works successfully with CUDA because using it gives a message that CUDA is not supported windows , build , cuda	7	162	August 9, 2024
Opencv-cuda error after updating cuda from version 12.3 to 12.4 C++ cuda	1	1231	March 10, 2024
OpenCV 4.2 with geforce 3050Ti C++ build , cuda	12	1261	August 9, 2023

Contributing using CUDA - versions and libraries

Related topics