Copy cv::cuda::GpuMat in Cuda Kernel

Aidan_Vickars · October 5, 2022, 3:14pm

thanks so much for the response!

So in regards to the threads, the block dim in the z dimension is set to three. This is done initially here:

. Thats how “out_in_z” will be 0,1, or 2, and as a result there will be 3 threads per BGR pixel.

One thing to note, I don’t believe it really is my “code” per say, be cause when I do an identical copy in the kernel with the exception that I copy the contents of the GpuMat to another GpuMat instead of a raw 1D array, it works just fine.

So that’s why I’m guessing I am miss-understanding the structure of a GpuMat in memory, specifically I think the issue is likely right around here inside the kernel.

Aidan_Vickars:

 // Computing the index of the thread for the flattened original frame
    int in_pos = (in_y * input.step) + (in_x * 3) + out_in_z;

    // Computing the index of the thread for the flattened new frame
    int out_pos = (out_y * out_width) + (out_x * 3) + out_in_z;

    output[out_pos] = input[in_pos];

This is why I’ve posted my question here instead of on say nvidia’s forum as the issue is likely a missunderstanding of how GpuMat is stored in memory and as a result I am not copying it properly.

Another interesting thing I’ve noticed as that when I force the source GpuMat to be continuous, the results are better. For context I’ve attached 2 pictures, 1 of the result I am getting when the source GpuMat is not continuous (the really bad jumbled of picture) and the other is the result I am getting when the source GpuMat is continuous (the not jumbled up picture but is incomplete).

Anyways I’ll try on the host side as you suggest and post an update.
test

So to confirm that I am understanding correctly how the underlying data is stored in a GpuMat, my understanding is that it is stored as follows:

1 Row has the following structure:

[B G R B G R … B G R unusedSpace] where each B/G/R occupies is single uchar and the unused space is there so that the row is a contiguous block of memory and that the memory occupied by the row matches the step size (i.e. Mat.step). Further in a single row we have a B G and R for every pixel in that row. I.e. if the image has width 640, then the row will be of size: sizeof(uchar) * 640 * 3 + some unused space.

Putting the rows together we have the following structure:

[B G R B G R … B G R unusedSpace]
[B G R B G R … B G R unusedSpace]
… (for every row in the image)
[B G R B G R … B G R unusedSpace]

such that a pointer to the 1st element in a row be achieved using the follow: Mat.data + (row * Mat.step)

Is this all correct? If something doesn’t make sense please let me know.

Topic		Replies	Views
Manually download cv::cuda::GpuMat to cv::Mat using cudaMemcpyAsync C++ cuda	1	161	August 14, 2024
How to address a cv::cuda::GpuMat variable, passed to kernel as cv::cudaPtrStepSz? C++ cuda	1	324	May 30, 2024
GpuMat simple operations: Error: Gpu API call (invalid configuration argument) in cv::cudev::grid_transform_detail::TransformDispatcher C++ cuda	4	131	March 23, 2025
How do I get the address of cv::cuda::GpuMat? C++ cuda	1	56	December 23, 2024
How to pass cv::cuda::GpuMat to a kernel as cv::cuda::PtrStepSz? C++ cuda	1	365	May 10, 2024

Copy cv::cuda::GpuMat in Cuda Kernel

Related topics