I have been experimenting with the asynchronous feature detection calls, but when I time them it seems they still block the CPU for a significant amount of time.
I tried to detect ORB features similar to the below:
void detectFeatures(cv::cuda::GpuMat& greyscaleImage)
{
cv::Ptr<cv::cuda::ORB> orb = cv::cuda::ORB::create(300, 1.2f, 8, 31, 0, 2, 0, 31, 20, true);
cv::cuda::Stream stream;
cv::cuda::GpuMat keypoints, descriptors;
m_rmatcher->getFeatureDetector()->detectAndComputeAsync(greyscaleImage, cv::noArray(), keypoints, descriptors, false, stream);
stream.waitForCompletion();
}
I would expect the asynchronous call to defer most of the CPU time to stream.waitForCompletion()
, however only around 2 ms is spent in that line, with detectAndComputeAsync
still taking around 12 ms.
I tried separating the calls into detectAsync and computeAsync, and it looks like the blocking time is mostly spent in the detection part. I also tried FAST feature detection, and found a similar issue, with the majority of time being spent on the CPU. Turning off nonmaxsupression helped reduce the blocking time, but there doesn’t seem to be an option to modify this in the ORB detector and this might reduce the quality of the features.
I’ve tried various things such as running multiple times, changing various options and preallocating the memory for the keypoints and descriptors, however nothing seems to help.
Is there else I can look at in my setup, or is this expected and these functions are not really asynchronous?