Where is ptr pointing for a cuda::GpuMat?

I have not been able to find much info on cv::cuda::PtrStepSzf except a couple of articles (one in japanese) that I am reading. They basically say what you wrote.

My question is, in terms of speed (performance) , is there some penalty to use the simpler
dOutput(iRow, iCol) rather than a GpuMat and then ptr and step??

I am trying to get as much speed as I can get and though I would prefer to use PtrStepSzf, I cannot risk to lose speed.

Edit: Turns out I cannot use ptr inside a kernel, and I have to pass data and step directly. Just found out :frowning: