Copy cv::cuda::GpuMat in Cuda Kernel

thanks so much for the response!

So in regards to the threads, the block dim in the z dimension is set to three. This is done initially here:

. Thats how “out_in_z” will be 0,1, or 2, and as a result there will be 3 threads per BGR pixel.

One thing to note, I don’t believe it really is my “code” per say, be cause when I do an identical copy in the kernel with the exception that I copy the contents of the GpuMat to another GpuMat instead of a raw 1D array, it works just fine.

So that’s why I’m guessing I am miss-understanding the structure of a GpuMat in memory, specifically I think the issue is likely right around here inside the kernel.

This is why I’ve posted my question here instead of on say nvidia’s forum as the issue is likely a missunderstanding of how GpuMat is stored in memory and as a result I am not copying it properly.

Another interesting thing I’ve noticed as that when I force the source GpuMat to be continuous, the results are better. For context I’ve attached 2 pictures, 1 of the result I am getting when the source GpuMat is not continuous (the really bad jumbled of picture) and the other is the result I am getting when the source GpuMat is continuous (the not jumbled up picture but is incomplete).

Anyways I’ll try on the host side as you suggest and post an update.
test

So to confirm that I am understanding correctly how the underlying data is stored in a GpuMat, my understanding is that it is stored as follows:

1 Row has the following structure:

[B G R B G R … B G R unusedSpace] where each B/G/R occupies is single uchar and the unused space is there so that the row is a contiguous block of memory and that the memory occupied by the row matches the step size (i.e. Mat.step). Further in a single row we have a B G and R for every pixel in that row. I.e. if the image has width 640, then the row will be of size: sizeof(uchar) * 640 * 3 + some unused space.

Putting the rows together we have the following structure:

[B G R B G R … B G R unusedSpace]
[B G R B G R … B G R unusedSpace]
… (for every row in the image)
[B G R B G R … B G R unusedSpace]

such that a pointer to the 1st element in a row be achieved using the follow: Mat.data + (row * Mat.step)

Is this all correct? If something doesn’t make sense please let me know.