GpuMat simple operations: Error: Gpu API call (invalid configuration argument) in cv::cudev::grid_transform_detail::TransformDispatcher

BadTrip · March 22, 2025, 2:41pm

I have another problem using cudaCV, it’s crashing in circumstances that I don’t understand.

my code is working with cpu Mat but when trying to do the same with GpuMat, it crashes. My plan is to perform an per-element absdiff on two large multidimensional GpuMats . While trying to debug my code I found out, that it’s not only absdiff but also add or subtract that’s crashing in circumstances that I don’t understand

Code:

int main()
{   
    Mat BGR, reshaped;
    GpuMat inp, out;   
    BGR = imread("...");
    cout << "\nInput: "; dispMatInfo(BGR);

    reshaped = BGR.clone(); inp = GpuMat(reshaped);
    cout << "\nGpuMat: "; dispMatInfo(inp);
    cuda::add(inp, inp, out); //works

    reshaped = BGR.reshape(1); inp = GpuMat(reshaped);
    cout << "\nreshape(1): "; dispMatInfo(inp);
    cuda::add(inp, inp, out); //works

    reshaped = BGR.reshape(1,BGR.rows*BGR.cols); inp = GpuMat(reshaped);
    cout << "\nreshape(1,N): "; dispMatInfo(inp);
    cuda::add(inp, inp, out); //crash
    return 0;
}

Output:

Input:        Rows: 1000     Cols: 1000  Channels: 3  Type: 16  Depth: 0  Flags: 0x1124024336  isCont:1
GpuMat:       Rows: 1000     Cols: 1000  Channels: 3  Type: 16  Depth: 0  Flags: 0x1124007952  isCont:0
reshape(1):   Rows: 1000     Cols: 3000  Channels: 1  Type: 0   Depth: 0  Flags: 0x1124007936  isCont:0
reshape(1,N): Rows: 1000000  Cols: 3     Channels: 1  Type: 0   Depth: 0  Flags: 0x1124007936  isCont:0

OpenCV: terminate handler is called! The last OpenCV error is:
OpenCV(4.11.0) Error: Gpu API call (invalid configuration argument) in cv::cudev::grid_transform_detail::TransformDispatcher<true, Policy>::call, file ...\opencv_contrib-4.x\modules\cudev\include\opencv2\cudev\grid\detail/transform.hpp, line 411

Return Code "-1073740791"

So why does it work with 1000x3000x1 and 1000x1000x3 but not with 1000000x3x1? My current matrices are 1000000x16x8. I’ve also noticed a limited functionality of cuda::transpose and a completely missing repeat that would be very usefull

cudawarped · March 22, 2025, 6:43pm

The short answer is that the CUDA routines were designed for computer vision and they did not account for a 1 million row image with 3columns.

The slightly longer reason is that due to the above computer vision constraint binary operations between 2 arrays are processed by blocks of 32x8 threads. This works well for any reasonable image however when you have 1 million rows and only 3 columns this results in 125000 blocks of threads being launched which exceeds the maximum number of blocks in the y dimension (65535). If instead you have 1 million columns it should work as the maximum number of blocks in the x dimension is much larger.

Now the routines can be easily be re-written to accomodate the shape you are using by processing more elements in each thread but as the CUDA modules are in the contrib respoitory it is unlikely to happen unless you do this your self.

BadTrip · March 22, 2025, 11:14pm

it does, if I do inp = GpuMat(reshaped.t()); and upload the transposed matrix, it doesn’t crash. Now I have to wrap the whole code around because cuda::transpose doesn’t really work with n-channel arrays

But it is computer vision. I’ve tried to make a performant color reduction using KMeans (universal for n-dimensions) with processing the whole distance matrix between the samples and the prototypes using cudaCV. A CPU version of KMeans is part of OpenCV’s core, but I don’t know how many dimensions it can handle and it doesn’t exist for GPU.

cudawarped · March 23, 2025, 5:13am

If you want to fix this yourself you can use the suggestion I mentioned in reply to a similar post

github.com/opencv/opencv

the cv::cuda::transpose crash when CV_32F rows exceed 1048560

opened 01:05PM - 16 Oct 24 UTC

braindevices

bug category: gpu/cuda (contrib)

### System Information OpenCV: 4.10.0 compiler: clang 17.0.6 Platform: almali…nux9 cuda sdk: 12.3 ### Detailed description ``` exception message: OpenCV(4.10.0) opencv-4.10.0/contrib/modules/cudev/include/opencv2/cudev/grid/detail/transpose.hpp:118: error: (-217:Gpu API call) invalid configuration argument in function 'transpose ``` ### Steps to reproduce ```cpp for (int i = 1048555; i <= 1048561; i += 1) { cv::Mat src(i, 4, CV_32F, cv::Scalar{2}); cv::Mat dst; cv::RNG rng{}; rng.fill( src, cv::RNG::UNIFORM, 0, 200 ); // CPU works cv::transpose(src, dst); // GPU works cv::cuda::GpuMat d_src(src); cv::cuda::GpuMat d_dst; cv::cuda::GpuMat d_dst2{src.cols, src.rows, src.type(), cv::Scalar{10}}; std::cout << cv::cuda::sum(d_src) << "| " << cv::cuda::sum(d_dst2) << std::endl; cv::cuda::transpose(d_src, d_dst); cv::cuda::transpose(d_src, d_dst2); std::cout << cv::cuda::sum(d_src) << "| " << cv::cuda::sum(d_dst) << "| " << cv::cuda::sum(d_dst2) << std::endl; // Check results bool passed = cv::norm(dst - cv::Mat(d_dst), cv::NORM_INF) < 1e-3; bool passed2 = cv::norm(dst - cv::Mat(d_dst2), cv::NORM_INF) < 1e-3; std::cout<< "i=" << i << " dst without memory initalized:" << (passed ? "passed" : "FAILED") << std::endl; std::cout<< "i=" << i << "dst with memory initalized:" << (passed2 ? "passed" : "FAILED") << std::endl; // Deallocate data here, otherwise deallocation will be performed // after context is extracted from the stack d_src.release(); d_dst.release(); d_dst2.release(); std::cout << "released\n"; } ``` ### Issue submission checklist - [X] I report the issue, it's not a question - [X] I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution - [X] I updated to the latest OpenCV version and the issue is still there - [X] There is reproducer code and related data files (videos, images, onnx, etc)

BadTrip · March 23, 2025, 10:38am

yes, like i would know about gpu programming, cuda, npp or what blocks are. all i want is to use the OpenCV library that I know a bit together with my new cuda-able gpu and to see how much faster it can get. I don’t even know how to implement a cuda::repeat() myself in a way that’s not incredibly slow with using src.row(a).copyTo(dst.row(b)) or copy.to(M(dst(ROI)) and gets even slower when I try to use openMP

my issue with transpose is more like the multidimensionality:

Mat c = Mat(5, 7, CV_32FC(8));
GpuMat g = GpuMat(c);
cv::transpose(c, c); //works
cuda::transpose(g, g); //crash

but using CV_32FC(9) will also cause cv::transpose to crash

Topic		Replies	Views
Performing reshape on GpuMat cause unexpected error C++ cuda	1	365	September 27, 2021
Copy cv::cuda::GpuMat in Cuda Kernel C++ cuda	4	1734	October 5, 2022
GpuMat to openCV C++ opengl	1	318	December 12, 2023
cv::Matx in cuda kernel C++ cuda , core	6	995	November 4, 2022
Clahe.cu:398: error: (-217:Gpu API call) invalid configuration argument in function 'transform'	4	2115	March 29, 2021

GpuMat simple operations: Error: Gpu API call (invalid configuration argument) in cv::cudev::grid_transform_detail::TransformDispatcher

Related topics