SURF_CUDA performance

Well, the problem with measuring and comparing this is that detectAndCompute() seems to use parallelism internally, seemingly depending on how many instances of cv::xfeatures2d::SURF are created. I get the following times:

Total 1 is the time for the entire function.
Total 2 is the time only to compute descriptors.

CPU (1 thread):
Total 1: 7.77 s
Total 2: 6.74 s

CPU (2 threads):
Total 1: 15.2 s
Total 2: 28.9 s (single thread)

CPU (32 threads):
Total 1: 4.31 s
Total 2: 121 s (single thread)

GPU (1 thread):
Total 1: 3.079 s
Total 2: 1.924 s

If you do that you will probably find that the first call to surf() is taking most of your measured time

That doesn’t really match what I’m seeing. Here are the times to compute only descriptors for each image using 1 thread for both CPU and GPU:


On the first graph, blue is CPU and orange is GPU. Unit is seconds. On the second one, blue is GPU time as a fraction of CPU time. So, the GPU is being about 3 times faster, if we ignore load times.