CUDA Fast detector much slower than normal FAST

I am using OpenCV FAST Feature detector in both CPU and CUDA versions as follows:

For the normal CPU version

cv::FAST(img,kps, threshold, true);

For the CUDA version

ptrDetector= cv::cuda::FastFeatureDetector::create(threshold, true);

and then later

ptrDetector->setThreshold(threshold);
ptrDetector->detect(img,kps);

I measure how long does it take to process and in my Ubuntu PC it takes

CPU version: 0.424199 ms

CUDA version: 0.871586 ms

I wonder why is this like this? Is CUDA not adequate for FAST detectors?

EDIT:
I just want to point out that there is a similar question to this , but in there the answer was basically “it takes time to upload and download data”

which in my case, is not the answer, because I am timing only the processing time (excluding the upload and download time)

Take a look at this question too: https://forum.opencv.org/t/opencv-cuda-extremely-slow

Basically, the GPU startup takes some time. For me, the processing of the first image took 20-30x more time as the next images.

I suggest to test several images in a loop and measure the processing time for each image, and see if there’s an improvement.

Thanks for the reply. I am aware of the time that initialization takes (it was something that surprised me a lot actually)
My current question though refers to the amount of time FAST takes for a lot of images. In my current program I do FAST on 1000 images. CUDA keeps being slow

I tried also on a Jetson AGX Xavier and I got

CPU version 0.89ms
CUDA version 4.51 ms (+0.39ms for upload)

In that post I mentioned that it could be the hardware which is limiting you. The spreadsheet I linked to there shows the performance of FAST on lower end GPU’s is not that much faster than on the CPU, although I would still expect a speed increase. It also looks to be image dependent.

Just running the numbers a Jetson AGX Xavier is only marginally faster than a Geforce GTX 650 Ti (NVIDIA Jetson AGX Xavier 16 GB Specs | TechPowerUp GPU Database), a mid range graphics card released in 2012. I can’t comment on the performance of the ARM processor but my guess would be that it is not unthinkable that for this algorithm the performance would be superior.

Additionally did you read the post @kbarni linked to where we discussed other factors which could be limiting performance.

I guess if everything is fully optimized in your set up then the answer would be no on your hardware but yes on faster GPU’s. For example a state of the art GPU could have ~25x the performance if the size of the images and/or the implementation was sufficient to saturate the GPU.

Thanks. I am in the process of reading the post. Some of the suggestions (not measuring upload and download for example) are originally done, so doesn’t help much but there is one (pre-allocating) that I want to try. Oh, and I am using C++.

The phenomena that CUDA is slower happens both in the Xavier and in my Ubuntu Host that has a GeForce RTX 2080 Mobile

Can you send me a code snipet, I have the same GPU?

Can you send me a code snipet, I have the same GPU?

I will simplify the code a bit so that it takes only one cpp file and send it tomorrow thanks

Thank you for sending the code over I have had a quick look on my machine to see what times I get. I made two slight alterations to your code:

  1. I averaged 100 iterations because the CPU hi res timers were slightly pessimistic which I confirmed using event timers on the GPU.
  2. I removed std::cout from the timing as this CPU operation should not be included.

With the parameters you are using on a 1500 x 1500 image of a checkerboard the execution time for both the CPU and GPU was almost the same with the CPU slightly faster. i7-8700 vs RTX 2080 mobile.
CPU: 0.84 ms
GPU: 1.00 ms

If I use detectAsync() instead of detect() which also downloads the keypoints and processes them on the CPU then the GPU time is reduced.
GPU: 0.90 ms

I then had a look at the execution time in the Nvidia Visual Profiler and noticed that the execution time is ~0.1 ms. The 0.9ms is loss caused by:

  1. The routine allocates memory on the device and copies the number of keypoints from the device to the host which is included in what you are timing. This requires synchronizing the device with the host and copying from device to host memory which stalls everything, taking longer than the kernel itself.
  2. Including non max suppression. This requires an extra kernel call with extra synchronization and memory overhead.

This is a good example demonstrating that the implementation of the algorithm is as important as the raw performance of the GPU.

If I remove max suppression on both the CPU and GPU the GPU time falls significantly.
CPU 0.71 ms
GPU 0.17 ms

If I use a lower threshold say 20 (which is the number chosen in the inbuilt performance test) then the GPU time does not appear to be affected however the CPU time doubles.
CPU 1.42 ms
GPU 0.14 ms

I know this probably doesn’t help that much but it confirms to me the results from the performance test where the GPU was several times faster than the CPU.

3 Likes

Thank you very much for your analysis. Today I will go and try to reproduce your experiments. I haven’t use the Nvidia Visual Profiler yet but I will google it. If you have a recommended resource for its use, that would also be very much welcomed.

Just one question, when you said that the profiler notified 0.1ms does that mean that the feature extraction itself only took that time and the rest (0.8ms) was allocation copies and synchronization?

Perhaps tomorrow I will have some results to reflect on. Thanks again

The Visual Profiler for CUDA 11.3 can be found

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\libnvvp\nvvp.exe

You need to install the JRE for it to work and I you may have to add

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\extras\CUPTI\lib64

to your system/user path.
It shows a break down of all the CUDA kernel launches memory operation synchronization etc.
I would look at that break down with reference to

and

where you can see calls to cudaMalloc(), cudaFree() and several calls to both cudaMemsetAsync() and cudaStreamSynchronize().

1 Like