SURF_CUDA performance

Sorry in your code the first call to img_gpu.upload(img); will initialize the context.

So it looks like cuda::SURF is ~3x faster than the cpu version. Can you try the OpenCV test image (opencv_extra/testdata/gpu/features2d/aloe.png) and see if your timings are of the same order as mine, in case I am missing something?