Thank you for sending the code over I have had a quick look on my machine to see what times I get. I made two slight alterations to your code:
- I averaged 100 iterations because the CPU hi res timers were slightly pessimistic which I confirmed using event timers on the GPU.
- I removed std::cout from the timing as this CPU operation should not be included.
With the parameters you are using on a 1500 x 1500 image of a checkerboard the execution time for both the CPU and GPU was almost the same with the CPU slightly faster. i7-8700 vs RTX 2080 mobile.
CPU: 0.84 ms
GPU: 1.00 ms
If I use detectAsync() instead of detect() which also downloads the keypoints and processes them on the CPU then the GPU time is reduced.
GPU: 0.90 ms
I then had a look at the execution time in the Nvidia Visual Profiler and noticed that the execution time is ~0.1 ms. The 0.9ms is loss caused by:
- The routine allocates memory on the device and copies the number of keypoints from the device to the host which is included in what you are timing. This requires synchronizing the device with the host and copying from device to host memory which stalls everything, taking longer than the kernel itself.
- Including non max suppression. This requires an extra kernel call with extra synchronization and memory overhead.
This is a good example demonstrating that the implementation of the algorithm is as important as the raw performance of the GPU.
If I remove max suppression on both the CPU and GPU the GPU time falls significantly.
CPU 0.71 ms
GPU 0.17 ms
If I use a lower threshold say 20 (which is the number chosen in the inbuilt performance test) then the GPU time does not appear to be affected however the CPU time doubles.
CPU 1.42 ms
GPU 0.14 ms
I know this probably doesn’t help that much but it confirms to me the results from the performance test where the GPU was several times faster than the CPU.