What can the function in cv::cuda do?

I want to calculate the Discrete Fourier Transform(DFT) of several images and then calculate the conjugate product of two corresponding matrix. There are many images(more than 10000 per group), but each image size is very small(matrix with shape100×100). So I need to use opencv-cuda for efficiency. I want to process each image matrix in a thread. However, the function cv::cuda::dft is a host function and can not be called in device code.

I wonder whether the cv::cuda::dft is a function than can operate GpuMat like cv::dft operate cv::mat in cpu, or when using this function(cv::cuda::dft), it will run parallel.

Since each matrix is a small one, the parallel computing advantage of a single matrix is not outstanding. How can I do the task parallel in Gpu, do I need to write DFT function in device code myself?
:face_with_monocle:

Do you mean a device thread? If so forget it, that’s not how CUDA works.

You can’t, the GPU is data parallel.

I would start by timing the execution on a single small matrix, then see what the scaling is like when you increase the number of matricies you process one after the other, you may get some kernel overlap. If your matrices are on the CPU then you would need to efficiently copy them to the GPU before hand or while processing so their is no delay from the memory transfer.

Additionally I would investigate whether their is any batch based DFT functions in the CUDA librarires.

Yeah, definitely not a task for OpenCV but for bare CUDA or its associated libraries. DFT is a basic building block for a lot of algorithms. They will have something you can use.

Thanks for your helpful reply.
I find a batch based DFT functions named cufftPlanMany in CUDA libraries. This method is not used because I think assignning multiple tasks to the single stream and these streams run parallelly. However, time cost is larger than expected. Any suggestion on how to find the reason or improve the strategy?

// This is the main part codes
for (size_t i = 0;i < cudaStreams.size();++i) 
{
fun_single_stream(mat1_ls[i],mat2_ls[i],cudaStream[i]);
}
void fun_single_stream(cv::cuda::GpuMat& img_mat1,cv::cuda::GpuMat& img_mat2,cv::cuda::Stream stream_local)
{
cv::cuda::dft(mat1,  tmp_gpu_mat1 regionSize, cv::DFT_SCALE, stream_local);
cv::cuda::dft(mat2, tmp_gpu_mat2, regionSize, cv::DFT_SCALE, stream_local);
cv::cuda::mulSpectrums(tmp_gpu_mat1, tmp_gpu_mat2, tmp_gpu_mat3, cv::DFT_COMPLEX_OUTPUT, true, stream_local);
cv::cuda::dft(tmp_gpu_mat1, tmp_gpu_mat2, regionSize, cv::DFT_REAL_OUTPUT | cv::DFT_INVERSE | cv::DFT_COMPLEX_INPUT,stream_local);
...
}