CUDA: SIFT or SURF, disappointed by execution timings

I’m trying to read a video frame by frame with Python and extract features on every frame. Sift and Surf on CPU are too slow and we have a high end GPU so I’ve been experimenting with opencv’s SURF_CUDA(OpenCV: cv::cuda::SURF_CUDA Class Reference) and the speed really isn’t much better than on CPU(I think something like 40ms) when I expected it to be 12x faster or more.

I’ve tried PopSift and CudaSift as well and neither performed under 10ms per frame consistently. I’m using a 1660 gpu for testing.

    for (;;) 
	{
		cap >> frame;
		if (frame.empty()) break;
        cv::cvtColor(frame, gray, cv::COLOR_BGR2GRAY);


        
    
        // OpenCV SURF
        start = std::clock();
        img1.upload(gray);
        surf->detect(img1, keypoints);
        img1.download(desc);
        std::cout << "Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;

you are clocking the upload and download too. why?

in a real application, you would stream the data, so uploads and downloads (of next frame and result) can happen WHILE calculation (on current frame) takes place.

did you benchmark JUST upload and download, without calculation?

I just included it because the upload and download were only a couple ms anyway.
If I remove download/upload from the calculation like this it still takes about 50ms on average on a 1080p video using a NVIDIA 1660.

#include <opencv2/opencv.hpp>
#include <opencv2/opencv_modules.hpp>
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/cudafeatures2d.hpp>
#include <opencv2/cudaarithm.hpp>
#include <opencv2/xfeatures2d/cuda.hpp>

#include <stdio.h>
#include <string>
#include <iostream>

#include <opencv2/features2d.hpp>
int main()
{
   
    cv::cuda::printShortCudaDeviceInfo(cv::cuda::getDevice());
    cv::cuda::SURF_CUDA surf;
    std::clock_t start;
    string filename = "1080.mp4";
    VideoCapture cap(filename); 
    
    Mat frame;
    if (!cap.isOpened())
	{
		std::cerr << "Couldn't open capture." << std::endl;
		return -1;
	}
	
	cv::Mat bgr_frame, gray, canny;

    cv::cuda::GpuMat img1;
    vector<KeyPoint> keypoints;
    cv::Mat desc;


    for (;;) 
	{
		cap >> frame;
		if (frame.empty()) break;
        cv::cvtColor(frame, gray, cv::COLOR_BGR2GRAY);
        unsigned char* dataMat = gray.data;
        // OpenCV SURF
        img1.upload(gray);
        start = std::clock();
        surf(img1,cv::cuda::GpuMat(), keypoints);
        std::cout << "Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
        img1.download(desc);
        
		char c = cv::waitKey(10);
		if (c == 27) break;
	}
	cap.release();
	return 0;
}

Here’s my device information from opencv too:

Device information:
    Name: NVIDIA GeForce GTX 1660
    Compute Capability:    7.5
    Total device mem:      6441992192 B 6291008 kB 6143 MB
    Per-block shared mem:  49152
    Warp size:             32
    Max threads per block: 1024
    Max threads per SM(X): 1024
    Max block sizes:       {1024,1024,64}
    Max grid sizes:        {2147483647,65535,65535}
    Number of SM(x)s:      22
    Concurrent kernels:    yes
    Mapping host memory:   yes
    Unified addressing:    yes

Looking back at SURF_CUDA performance, the GPU performance of SURF is not that great and is heavily dependant on the image and its size. In the end I was seeing ~14.8ms per 1280x1180 frame on an RTX 3070 Ti.

Now texture references have been removed (Fix CUDA texture bugs and replace all instances of CUDA texture references with texture objects by cudawarped · Pull Request #3378 · opencv/opencv_contrib · GitHub) it may be possible for the performance of SURF to be improved by taking advantage of CUDA streams but that would probably not be an easy fix.

So if cuda bases Sift and Surf aren’t much faster with OpenCV does that only leave PopSift(GitHub - alicevision/popsift: PopSift is an implementation of the SIFT algorithm in CUDA.) and CudaSift(GitHub - Celebrandil/CudaSift: A CUDA implementation of SIFT for NVidia GPUs (1.2 ms on a GTX 1060)) as fast SIFT alternatives?

Because in my testing of PopSift it ran very slow, 200ms on default settings with the code below on the same gpu and 1080p video:

// main.cpp
//#include <opencv2/opencv.hpp>
//#include <opencv2/core/core.hpp>
#include <opencv2/opencv.hpp>
#include <opencv2/opencv_modules.hpp>
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/cudafeatures2d.hpp>
#include <opencv2/cudaarithm.hpp>
#include <opencv2/xfeatures2d/cuda.hpp>

#include <popsift/popsift.h>
#include <popsift/features.h>
#include <stdio.h>
#include <string>
#include <iostream>

#include <opencv2/features2d.hpp>


using namespace cv;
using namespace std;
using std::string;

int main()
{

    cudaDeviceReset();
    
    std::clock_t start;
    popsift::Config config;
    config.setDownsampling(0);
    config.setFilterMaxExtrema(false);
    config.setVerbose(true);
    
    PopSift PopSift(
        config, 
        popsift::Config::ExtractingMode,
        PopSift::ByteImages
    );
    

    popsift::cuda::device_prop_t deviceInfo;
    deviceInfo.print();
    
    string filename = "1080.mp4";
    VideoCapture cap(filename); 
        
    Mat frame;
    if (!cap.isOpened())
	{
		std::cerr << "Couldn't open capture." << std::endl;
		return -1;
	}
	
	cv::Mat bgr_frame, gray;

    cv::cuda::GpuMat img1, img2;
    cv::cuda::GpuMat keypoints1GPU;
    cv::cuda::GpuMat descriptors1GPU;

	for (;;) 
	{
		cap >> frame;
		if (frame.empty()) break;
        cv::cvtColor(frame, gray, cv::COLOR_BGR2GRAY);
        unsigned char* dataMat = gray.data;
        
        // PopSift
        SiftJob* job = PopSift.enqueue(frame.cols, frame.rows, dataMat);
        start = std::clock();
        popsift::Features* feature_list = job->get();
        std::cout << "Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
        cerr << "Number of feature points: " << feature_list->getFeatureCount()
         << " number of feature descriptors: " << feature_list->getDescriptorCount()
         << endl;       
		char c = cv::waitKey(10);
		if (c == 27) break;
	}

	cap.release();
	return 0;
}

And with CudaSift I get around 10ms but with some frames as high as 40ms

#include <iostream>  
#include <cmath>
#include <iomanip>
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>

#include "cudaImage.h"
#include "cudaSift.h"

using namespace cv;
using namespace std;
using std::string;

int main(int argc, char** argv){
  /* Reserve memory space for a whole bunch of SIFT features. */
  SiftData siftData;
  InitSiftData(siftData, 185000, true, true);
  CudaImage img;

  int numOctaves = 2;    /* Number of octaves in Gaussian pyramid */
  float initBlur = 1.0f; /* Amount of initial Gaussian blurring in standard deviations */
  float thresh = 3.5f;   /* Threshold on difference of Gaussians for feature pruning */
  float minScale = 0.0f; /* Minimum acceptable scale to remove fine-scale features */
  bool upScale = false;  /* Whether to upscale image before extraction */


  std::clock_t start;
  cout << "Is mainSift running?";
  string filename = "1080.mp4";
  VideoCapture cap(filename); 
  Mat frame, tmp;
  if (!cap.isOpened())
	{
		std::cerr << "Couldn't open capture." << std::endl;
		return -1;
	}

  for (;;) 
	{
		//cap >> frame;
    cap.read(frame);
		if (frame.empty()) break;
           
    cv::imshow("frame", frame);
    //cv::cvtColor(frame, gray, cv::COLOR_BGR2GRAY);

    
    frame.convertTo(tmp, CV_32FC1);
    cout << frame.cols;
        
    /* Allocate 1280x960 pixel image with device side pitch of 1280 floats. */ 
    /* Memory on host side already allocated by OpenCV is reused.           */
    int64 t1 = cv::getTickCount();

    img.Allocate(frame.cols, frame.rows, 1280, false, NULL, (float*) tmp.data);
    img.Download();


    /* Extract SIFT features */
    ExtractSift(siftData, img, numOctaves, initBlur, thresh, minScale, upScale);
    int64 t2 = cv::getTickCount();
    /* Free space allocated from SIFT features */
    //FreeSiftData(siftData);
    cout << "CURRENT END TIME OF A FRAME INFERENCE PLEASE SHOW: \n";
    double time_elapsed = (t2 - t1) / cv::getTickFrequency() * 1000.0;
    std::cout << "Time elapsed: " << time_elapsed << " ms" << std::endl;
    
		char c = cv::waitKey(10);
		if (c == 27) break;
	}
  cout << "End of loop?";
	cap.release();
	return 0;
}

I’m not very familiar with c, is there something wrong with the loop?
I’m not understanding how these repos claim in the order of ~1-4ms on a weaker gpu?

I can’t comment on the other methods but I would definitely compare all three methods on the same image. Running a couple of iterations before measuring the timing (to insure you are not timing the context initialization) as described in the thread I linked to and seeing which one is fastest. Additionally to avoid the upload overhead I would use cv::cudacodec::VideoReader() to read directly to cv::cuda::GpuMat.

Those two repos apparently don’t make use of gpuMat anyway. And I’ve tested on the same 1080p video and a secondary video. The only thing I can think of is for popsift they read a folder of images with DevIL and enqueued extraction while I’m running a blocking videoreader loop. So unless the VideoReader loop is the issue I can’t see what the problem is. Anyway PopSift takes a pointer to a char array as input so I can’t see why it makes any difference if it’s stored in a opencv frame.