Sorry in your code the first call to img_gpu.upload(img);
will initialize the context.
So it looks like cuda::SURF
is ~3x faster than the cpu version. Can you try the OpenCV test image (opencv_extra/testdata/gpu/features2d/aloe.png) and see if your timings are of the same order as mine, in case I am missing something?