I have another problem using cudaCV, it’s crashing in circumstances that I don’t understand.
my code is working with cpu Mat but when trying to do the same with GpuMat, it crashes. My plan is to perform an per-element absdiff on two large multidimensional GpuMats . While trying to debug my code I found out, that it’s not only absdiff but also add or subtract that’s crashing in circumstances that I don’t understand
So why does it work with 1000x3000x1 and 1000x1000x3 but not with 1000000x3x1? My current matrices are 1000000x16x8. I’ve also noticed a limited functionality of cuda::transpose and a completely missing repeat that would be very usefull
The short answer is that the CUDA routines were designed for computer vision and they did not account for a 1 million row image with 3columns.
The slightly longer reason is that due to the above computer vision constraint binary operations between 2 arrays are processed by blocks of 32x8 threads. This works well for any reasonable image however when you have 1 million rows and only 3 columns this results in 125000 blocks of threads being launched which exceeds the maximum number of blocks in the y dimension (65535). If instead you have 1 million columns it should work as the maximum number of blocks in the x dimension is much larger.
Now the routines can be easily be re-written to accomodate the shape you are using by processing more elements in each thread but as the CUDA modules are in the contrib respoitory it is unlikely to happen unless you do this your self.
it does, if I do inp = GpuMat(reshaped.t()); and upload the transposed matrix, it doesn’t crash. Now I have to wrap the whole code around because cuda::transpose doesn’t really work with n-channel arrays
But it is computer vision. I’ve tried to make a performant color reduction using KMeans (universal for n-dimensions) with processing the whole distance matrix between the samples and the prototypes using cudaCV. A CPU version of KMeans is part of OpenCV’s core, but I don’t know how many dimensions it can handle and it doesn’t exist for GPU.
yes, like i would know about gpu programming, cuda, npp or what blocks are. all i want is to use the OpenCV library that I know a bit together with my new cuda-able gpu and to see how much faster it can get. I don’t even know how to implement a cuda::repeat() myself in a way that’s not incredibly slow with using src.row(a).copyTo(dst.row(b)) or copy.to(M(dst(ROI)) and gets even slower when I try to use openMP
my issue with transpose is more like the multidimensionality:
Mat c = Mat(5, 7, CV_32FC(8));
GpuMat g = GpuMat(c);
cv::transpose(c, c); //works
cuda::transpose(g, g); //crash
but using CV_32FC(9) will also cause cv::transpose to crash