Fast feature detection in CUDA

I would like some help to understand the correct workings of the Cuda Fast Feature Detector.

After instantiating an object of this class I can detect keypoints with the function
detect with arguments (image ,vector). (I suppose the image is GpuMat). However I have been told here that this operation downloads the keypoints and processes them on the CPU.

I am interested in doing one more operation in the GPU (I am starting to code the kernel) that needs the keypoints. I suppose I can transform the vector (not really sure how to do that, but that is not the point of this question), upload it to the device and work with that but all this transfer from host <-> device seems unnecessary.

So is there a way to perform the Fast Detection and have the keypoints still on device memory??

I was recommended to use detectAsync(). How does this work? And, is it really “async”? (that might complicate things)

detectAsync has as arguments (InputArray, OutputArray) so I guess the second one is a GpuMat of keypoints? I suppose that is still in the device memory?
Perhaps I can use that but why is it Async?

My idea would be something like

cuda::GpuMat image, keypoints;
Cudadetector->detectAsync(image, keypoints);
int some_result=my_own_process << <N,1>> >(image, keypoints);

As you can see, after the detect I plan to call a Cuda kernel with the keypoints and I assume that my_own_process will run after detecAsync. The “Async” part alarms me a bit.

Is there other way to do this?

I am not sure exactly why they called it Async but I guess they needed a name which distinguished it from the original. The Async implies that all the internal operations are asynchronous with respect to the host, however from a quick inspection of the internals this does not seem to be the case, especially when nonmaxsuppression is used.

Anyway in summation you need to use detectAsync() if you want to leave the keypoints on the device.

If I were you I would time the detector for your images (size and content) and parameters (threshold and nonmaxsupression =true|false) before implementing your own kernel. I understand you may have already performed this analysis but from the timings I had in the previous post the download to the host of the keypoints did not look like it had the most impact on the execution time. Both nonmaxsuppression and the threshold appeared to have the most impact on the execution time.