OpenCV’s T-API (UMat
s) is asynchronous by design: tasks run in the background until
- either the result is requested via
UMat::getMat()
, - or manual synchronization is invoked via
cv::ocl::finish()
.
By looking at the source code, I am under the impression that at least in some cases also memory deallocation is asynchronous. This leads to allocation failures if memory-expensive calls are performed in a loop, even if cv::ocl::finish()
is explicitly invoked. For example:
#include <opencv2/opencv.hpp>
#include <opencv2/core/ocl.hpp>
#include <iostream>
int main()
{
auto const nLoops = 5;
auto const imageWidth = 46340; // Image size ~ 2 GiB.
for (int iLoop = 0; iLoop < nLoops; ++iLoop)
{
std::cout << "Loop " << iLoop << " begins.\n";
{
// bigImage will be destroyed as soon as it is out of scope.
auto const bigImage = cv::UMat::zeros(imageWidth, imageWidth, CV_8UC1);
}
cv::ocl::finish();
std::cout << "\n";
}
std::cout << "Success!";
}
Here the expensive function is cv::UMat::zeros()
, I observed an analogous behavior with cv::fastNlMeansDenoising()
.
Running the above example with
- OpenCV 4.1.1
- NVIDIA GeForce GTX 1050 with 4 GiB RAM, ~3.3 GiB available for OpenCL.
- NVIDIA driver version 446.14
- Visual Studio 2019 16.6.5
- Windows 10 64 bit
I get
Loop 0 begins.
[ INFO:0] global C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp (888) cv::ocl::haveOpenCL Initialize OpenCL runtime...
[ INFO:0] global C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp (430) cv::ocl::OpenCLBinaryCacheConfigurator::OpenCLBinaryCacheConfigurator Successfully initialized OpenCL cache directory: C:\Users\ANGELO~1.PER\AppData\Local\Temp\opencv\4.1\opencl_cache\
[ INFO:0] global C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp (454) cv::ocl::OpenCLBinaryCacheConfigurator::prepareCacheDirectoryForContext Preparing OpenCL cache configuration for context: NVIDIA_Corporation--GeForce_GTX_1050--446_14
Loop 1 begins.
OpenCL error CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) during call: clEnqueueNDRangeKernel('set', dims=2, globalsize=11776x46344x1, localsize=NULL) sync=false
OpenCV(4.1.1) Error: Unknown error code -220 (OpenCL error CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) during call: clEnqueueReadBuffer(q, handle=000001C44EA8A970, CL_TRUE, 0, sz=2147395600, data=000001C491D93080, 0, 0, 0)) in cv::ocl::OpenCLAllocator::map, file C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp, line 5089
OpenCV: terminate handler is called! The last OpenCV error is:
OpenCV(4.1.1) Error: Unknown error code -220 (OpenCL error CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) during call: clEnqueueReadBuffer(q, handle=000001C44EA8A970, CL_TRUE, 0, sz=2147395600, data=000001C491D93080, 0, 0, 0)) in cv::ocl::OpenCLAllocator::map, file C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp, line 5089
A hacky workaround is to allocate a small UMat
after the call to cv::ocl::finish()
: if I understand correctly, the cleanup queue is flushed during UMat
allocation.
[…]
}
cv::ocl::finish();
// It seems that an allocation flushes the cleanup queue!
auto const cleanupQueueFlusher = cv::UMat::zeros(1, 1, CV_8UC1);
std::cout << "\n";
}
[…]
Output:
Loop 0 begins.
[ INFO:0] global C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp (888) cv::ocl::haveOpenCL Initialize OpenCL runtime...
[ INFO:0] global C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp (430) cv::ocl::OpenCLBinaryCacheConfigurator::OpenCLBinaryCacheConfigurator Successfully initialized OpenCL cache directory: C:\Users\ANGELO~1.PER\AppData\Local\Temp\opencv\4.1\opencl_cache\
[ INFO:0] global C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp (454) cv::ocl::OpenCLBinaryCacheConfigurator::prepareCacheDirectoryForContext Preparing OpenCL cache configuration for context: NVIDIA_Corporation--GeForce_GTX_1050--446_14
Loop 1 begins.
Loop 2 begins.
Loop 3 begins.
Loop 4 begins.
Success!
What is the proper / official / API-provided way to make sure that unused memory is actually deallocated after a T-API function call?
Relevant GitHub issue.