Why some Mats don't need to be uploaded when using GPU?

Using your example:

cv::cuda::warpPerspective(InputArray src, OutputArray dst, InputArray M, Size dsize, int flags = INTER_LINEAR,
    int borderMode = BORDER_CONSTANT, Scalar borderValue = Scalar(), Stream& stream = Stream::Null())

In a nutshell src and dst are the data you are going to work on whereas M is just a collection of function arguments/parameters, in the same way that dsize, flags, borderMode and borderValue are.

Now I can’t find any official documentation in the CUDA programming guide to justify passing arguments in host rather than device memory so take the next paragraph with a pinch of salt.

It seems like the consensus is that there is an amount of latency say N ms in launching a kernel as this requires comunication between the host and the device. As the size of this communication is small it does not saturate the available bandwidth between the host and device, meaning there is room for a small amount of extra data to be sent at the same time without increasing N. Therefore a few extra parameters (function arguments in addition to kernel communication overhead) can be sent without any penalty and there would be no advantage to first copying your function arguments from the host to the device before launching the kernel. Conversly if you could pass the data as a host argument this would saturate the available bandwidth between the host and device and significantly increase N, which is what you can experiance when using managed memory.

2 Likes