SURF_CUDA performance

I’m extracting features with SURF_CUDA, and I’m not getting the speed improvements I was hoping for, compared to just using the CPU. I’m using a 16-core/32-thread 3950X and an RTX 3070. The CPU can crunch ~60 Mpx/s, while the GPU gets up to ~100 Mpx/s.
Here’s my code:

//Note: the data is being loaded from a ramdisk.

std::tuple<std::vector<std::vector<Descriptor<SURF>>>, std::uint64_t> compute_surf(const std::vector<std::string> &paths){
	std::vector<std::vector<Descriptor<SURF>>> ret;
	ret.reserve(paths.size());
	std::vector<std::thread> threads(std::thread::hardware_concurrency());
	std::atomic<size_t> i = 0;
	std::uint64_t pixels = 0;
	std::mutex mutex;
	for (auto &t : threads)
		t = std::thread([&](){
			auto surf = SURF::create();
			surf->setUpright(true);
			while (true){
				auto j = i++;
				if (j >= paths.size())
					break;
				Mat src = imread(paths[j], IMREAD_GRAYSCALE);
				std::vector<KeyPoint> keypoints;
				std::vector<float> descriptors;
				surf->detectAndCompute(src, Mat(), keypoints, descriptors);
				auto final_descriptors = shuffle_descriptors<SURF>(keypoints, descriptors);
				std::lock_guard<std::mutex> lg(mutex);
				pixels += src.cols * src.rows;
				ret.emplace_back(std::move(final_descriptors));
			}
		});

	for (auto &t : threads)
		t.join();
	
	return { ret, pixels };
}

std::tuple<std::vector<std::vector<Descriptor<SURF>>>, std::uint64_t> compute_surf_cuda(const std::vector<std::string> &paths){
	std::vector<std::vector<Descriptor<SURF>>> ret;
	ret.reserve(paths.size());
	std::vector<std::thread> threads(2);
	std::atomic<size_t> i = 0;
	std::uint64_t pixels = 0;
	std::mutex mutex;
	for (auto &t : threads)
		t = std::thread([&](){
			SURF_CUDA surf;
			surf.upright = true;
			surf.extended = false;
			GpuMat img_gpu;
			GpuMat keypoints_gpu;
			GpuMat descriptors_gpu;
			while (true){
				auto j = i++;
				if (j >= paths.size())
					break;
				auto img = imread(paths[j], IMREAD_GRAYSCALE);
				img_gpu.upload(img);
				surf(img_gpu, GpuMat(), keypoints_gpu, descriptors_gpu);
				std::vector<KeyPoint> keypoints;
				std::vector<float> descriptors;
				surf.downloadKeypoints(keypoints_gpu, keypoints);
				surf.downloadDescriptors(descriptors_gpu, descriptors);
				auto final_descriptors = shuffle_descriptors<SURF>(keypoints, descriptors);
				std::lock_guard<std::mutex> lg(mutex);
				pixels += img.cols * img.rows;
				ret.emplace_back(std::move(final_descriptors));
			}
		});

	for (auto &t : threads)
		t.join();
	
	return { ret, pixels };
}

Is there anything I can do to improve this? Note that all I need is the descriptors, to store them in a database. I have nothing else to do on the GPU once I get them, so there’s no opportunity to amortize the transfer cost.

EDIT: shuffle_descriptors() only moves some data around. Removing the call makes compute_surf_cuda() only about 2% faster.

How are you calculating this?

I ran a quick test (Mobile RTX 3070 Ti vs i7-12700H) with 20 identical images (opencv_extra/testdata/gpu/features2d/aloe.png) timing only the execution of SURF and found the GPU was significantly faster (~5ms vs ~500ms per image).

Thanks for your reply.
I’m calculating the speed like this:

volatile auto t0 = clock();
auto [results, total_pixels] = f(paths); //f is a function pointer to either of the two functions
volatile auto t1 = clock();

auto elapsed = double(t1 - t0) / CLOCKS_PER_SEC;

auto pixels_per_sec = total_pixels / elapsed;

elapsed is 34.9 for compute_surf() and 21.2 for compute_surf_cuda(). Both functions process the same set of images.

I would have to try measuring only the time spent getting descriptors. That’s a bit trickier to measure. If doing just that is really so much faster that’s rather a shame, since it means there’s nothing I can do at the software level to improve on these times; it’s a bandwidth limitation.

I would always try to simplify down as much as possible to start with, removing threading, multiple image processing etc. If you do that you will probably find that the first call to

surf(img_gpu, GpuMat(), keypoints_gpu, descriptors_gpu);

is taking most of your measured time because the first call will be creating the CUDA context on the device.

To test this without simplifying your code, try placing

GpuMat initCtx(1, 1, CV_8UC1);

before

volatile auto t0 = clock();
auto [results, total_pixels] = f(paths); //f is a function pointer to either of the two functions
volatile auto t1 = clock();

auto elapsed = double(t1 - t0) / CLOCKS_PER_SEC;

auto pixels_per_sec = total_pixels / elapsed;

to see if it alters the timings.

If that works you can probably improve matters futher by using CUDA streams, but only for the upload of your images because the SURF routines don’t seem to support CUDA streams. If you had a video instead of a series of images you could use cudacodec::VideoReader to decode directly to the GPU to avoid decoding the image on the CPU and uploading it to the GPU.

Well, the problem with measuring and comparing this is that detectAndCompute() seems to use parallelism internally, seemingly depending on how many instances of cv::xfeatures2d::SURF are created. I get the following times:

Total 1 is the time for the entire function.
Total 2 is the time only to compute descriptors.

CPU (1 thread):
Total 1: 7.77 s
Total 2: 6.74 s

CPU (2 threads):
Total 1: 15.2 s
Total 2: 28.9 s (single thread)

CPU (32 threads):
Total 1: 4.31 s
Total 2: 121 s (single thread)

GPU (1 thread):
Total 1: 3.079 s
Total 2: 1.924 s

If you do that you will probably find that the first call to surf() is taking most of your measured time

That doesn’t really match what I’m seeing. Here are the times to compute only descriptors for each image using 1 thread for both CPU and GPU:


On the first graph, blue is CPU and orange is GPU. Unit is seconds. On the second one, blue is GPU time as a fraction of CPU time. So, the GPU is being about 3 times faster, if we ignore load times.

(Sorry about the double post. It wouldn’t let me add both graphs in the same post.)

And here are the same graphs but using 32 threads for the CPU and still only 1 for the GPU:


If detectAndCompute() has internal parallelism, which it does seem to have, using so many threads completely disables it.

Sorry in your code the first call to img_gpu.upload(img); will initialize the context.

So it looks like cuda::SURF is ~3x faster than the cpu version. Can you try the OpenCV test image (opencv_extra/testdata/gpu/features2d/aloe.png) and see if your timings are of the same order as mine, in case I am missing something?

Using the following code:

const int N = 1000;

void test_surf_performance1(){
	auto surf = SURF::create();
	surf->setUpright(true);
	Mat src = imread("aloe.png", IMREAD_GRAYSCALE);
	std::uint64_t sum = 0;
	volatile auto t0 = std::chrono::high_resolution_clock::now().time_since_epoch().count();
	for (int i = N; i--;){
		std::vector<KeyPoint> keypoints;
		std::vector<float> descriptors;
		surf->detectAndCompute(src, Mat(), keypoints, descriptors);
		sum += descriptors.size();
	}
	volatile auto t1 = std::chrono::high_resolution_clock::now().time_since_epoch().count();
	std::cout << sum << std::endl;
	std::cout << (t1 - t0) * hrc / N << std::endl;
}

void test_surf_performance2(){
	SURF_CUDA surf;
	surf.upright = true;
	surf.extended = false;
	GpuMat img_gpu;
	GpuMat keypoints_gpu;
	GpuMat descriptors_gpu;
	auto img = imread("aloe.png", IMREAD_GRAYSCALE);
	img_gpu.upload(img);
	std::uint64_t sum = 0;
	volatile auto t0 = std::chrono::high_resolution_clock::now().time_since_epoch().count();
	for (int i = N; i--;){
		surf(img_gpu, GpuMat(), keypoints_gpu, descriptors_gpu);
		std::vector<float> descriptors;
		surf.downloadDescriptors(descriptors_gpu, descriptors);
		sum += descriptors.size();
	}
	volatile auto t1 = std::chrono::high_resolution_clock::now().time_since_epoch().count();
	std::cout << sum << std::endl;
	std::cout << (t1 - t0) * hrc / N << std::endl;
}

the measured time is 4.34 ms per call for the CPU and 2.13 ms per call for the GPU. Commenting out downloadDescriptors() makes a difference of 0.1 ms.

I get roughly the same times on my set up making the GPU version ~2x quicker on that image. Previously I was calculating poor resutls on the CPU because the first call to

surf->detectAndCompute(src, Mat(), keypoints, descriptors);

takes over a second and that was included in the timing. If you make a dummy call to that before starting your timing you should get similar timing results without so many iterations.

If I make the test image 4x larger (1280x1180) which may not be a realistic test feature wise I get ~5x speed up (72.4917ms vs 14.811ms)