Asynchronous feature detection not running asynchronously

I have been experimenting with the asynchronous feature detection calls, but when I time them it seems they still block the CPU for a significant amount of time.

I tried to detect ORB features similar to the below:

void detectFeatures(cv::cuda::GpuMat& greyscaleImage)
{
    cv::Ptr<cv::cuda::ORB> orb = cv::cuda::ORB::create(300, 1.2f, 8, 31, 0, 2, 0, 31, 20, true);
    cv::cuda::Stream stream;
    cv::cuda::GpuMat keypoints, descriptors;
    m_rmatcher->getFeatureDetector()->detectAndComputeAsync(greyscaleImage, cv::noArray(), keypoints, descriptors, false, stream);
    stream.waitForCompletion();
}

I would expect the asynchronous call to defer most of the CPU time to stream.waitForCompletion() , however only around 2 ms is spent in that line, with detectAndComputeAsync still taking around 12 ms.

I tried separating the calls into detectAsync and computeAsync, and it looks like the blocking time is mostly spent in the detection part. I also tried FAST feature detection, and found a similar issue, with the majority of time being spent on the CPU. Turning off nonmaxsupression helped reduce the blocking time, but there doesn’t seem to be an option to modify this in the ORB detector and this might reduce the quality of the features.

I’ve tried various things such as running multiple times, changing various options and preallocating the memory for the keypoints and descriptors, however nothing seems to help.

Is there else I can look at in my setup, or is this expected and these functions are not really asynchronous?

I think the calling the function detectAndComputeAsync is a little misleading. I would guess it is because the CUDA feature detectors inherit from Feature2DAsync which have these method names or it is because at some point Async in the name meant that the function took CUDA streams. Anyway the function requires calls to cudaStreamSynchronize throughout which is causing the host side delay you are seeing.

1 Like

Thanks for the explanation, good to know that this is expected.

related: c++ - OpenCV asynchronous ORB feature detection function is blocking on the CPU - Stack Overflow