How to ensure that GPU memory is actually deallocated after an OpenCV T-API function call?

angelo-peronio · December 15, 2020, 2:47pm

OpenCV’s T-API (UMats) is asynchronous by design: tasks run in the background until

either the result is requested via UMat::getMat(),
or manual synchronization is invoked via cv::ocl::finish().

By looking at the source code, I am under the impression that at least in some cases also memory deallocation is asynchronous. This leads to allocation failures if memory-expensive calls are performed in a loop, even if cv::ocl::finish() is explicitly invoked. For example:

#include <opencv2/opencv.hpp>
#include <opencv2/core/ocl.hpp>
#include <iostream>

int main()
{
    auto const nLoops = 5;
    auto const imageWidth = 46340;  // Image size ~ 2 GiB.

    for (int iLoop = 0; iLoop < nLoops; ++iLoop)
    {
        std::cout << "Loop " << iLoop << " begins.\n";
        {
            // bigImage will be destroyed as soon as it is out of scope.
            auto const bigImage = cv::UMat::zeros(imageWidth, imageWidth, CV_8UC1);
        }
        cv::ocl::finish();
        std::cout << "\n";
    }

    std::cout << "Success!";
}

Here the expensive function is cv::UMat::zeros(), I observed an analogous behavior with cv::fastNlMeansDenoising().

Running the above example with

OpenCV 4.1.1
NVIDIA GeForce GTX 1050 with 4 GiB RAM, ~3.3 GiB available for OpenCL.
NVIDIA driver version 446.14
Visual Studio 2019 16.6.5
Windows 10 64 bit

I get

Loop 0 begins.
[ INFO:0] global C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp (888) cv::ocl::haveOpenCL Initialize OpenCL runtime...
[ INFO:0] global C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp (430) cv::ocl::OpenCLBinaryCacheConfigurator::OpenCLBinaryCacheConfigurator Successfully initialized OpenCL cache directory: C:\Users\ANGELO~1.PER\AppData\Local\Temp\opencv\4.1\opencl_cache\
[ INFO:0] global C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp (454) cv::ocl::OpenCLBinaryCacheConfigurator::prepareCacheDirectoryForContext Preparing OpenCL cache configuration for context: NVIDIA_Corporation--GeForce_GTX_1050--446_14

Loop 1 begins.
OpenCL error CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) during call: clEnqueueNDRangeKernel('set', dims=2, globalsize=11776x46344x1, localsize=NULL) sync=false
OpenCV(4.1.1) Error: Unknown error code -220 (OpenCL error CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) during call: clEnqueueReadBuffer(q, handle=000001C44EA8A970, CL_TRUE, 0, sz=2147395600, data=000001C491D93080, 0, 0, 0)) in cv::ocl::OpenCLAllocator::map, file C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp, line 5089
OpenCV: terminate handler is called! The last OpenCV error is:
OpenCV(4.1.1) Error: Unknown error code -220 (OpenCL error CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) during call: clEnqueueReadBuffer(q, handle=000001C44EA8A970, CL_TRUE, 0, sz=2147395600, data=000001C491D93080, 0, 0, 0)) in cv::ocl::OpenCLAllocator::map, file C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp, line 5089

A hacky workaround is to allocate a small UMat after the call to cv::ocl::finish(): if I understand correctly, the cleanup queue is flushed during UMat allocation.

[…]
        }
        cv::ocl::finish();
        // It seems that an allocation flushes the cleanup queue!
        auto const cleanupQueueFlusher = cv::UMat::zeros(1, 1, CV_8UC1);
        std::cout << "\n";
    }
[…]

Output:

Loop 0 begins.
[ INFO:0] global C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp (888) cv::ocl::haveOpenCL Initialize OpenCL runtime...
[ INFO:0] global C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp (430) cv::ocl::OpenCLBinaryCacheConfigurator::OpenCLBinaryCacheConfigurator Successfully initialized OpenCL cache directory: C:\Users\ANGELO~1.PER\AppData\Local\Temp\opencv\4.1\opencl_cache\
[ INFO:0] global C:\tools\vcpkg\buildtrees\opencv4\src\4.1.1-fb9e10326a.clean\modules\core\src\ocl.cpp (454) cv::ocl::OpenCLBinaryCacheConfigurator::prepareCacheDirectoryForContext Preparing OpenCL cache configuration for context: NVIDIA_Corporation--GeForce_GTX_1050--446_14

Loop 1 begins.

Loop 2 begins.

Loop 3 begins.

Loop 4 begins.

Success!

What is the proper / official / API-provided way to make sure that unused memory is actually deallocated after a T-API function call?

Relevant GitHub issue.

angelo-peronio · December 15, 2020, 3:37pm

Cross-posted from Stack Overflow.

berak · December 15, 2020, 4:23pm

can you try with a more recent release ?

angelo-peronio · December 15, 2020, 4:59pm

I have tried also with OpenCV 4.3, with analogous outcomes. Later versions have not made their way into vcpkg yet, it could take me a little while to set the test up.

angelo-peronio · December 16, 2020, 4:55pm

More tests, now with:

OpenCV 4.5.0
NVIDIA GeForce GTX 1050 with 4 GiB RAM, ~3.3 GiB available for OpenCL.
NVIDIA driver version 457.30
Visual Studio 2019 16.8.3
Windows 10 20H2

Test code:

#include <opencv2/opencv.hpp>
#include <opencv2/core/ocl.hpp>
#include <iostream>

int main()
{
    auto const nLoops = 5;
    auto const imageWidth = 46340;  // Image size ~ 2 GiB.
    auto* const allocatorPtr = cv::ocl::getOpenCLAllocator();
    auto* const bpcPtr = allocatorPtr->getBufferPoolController();
    std::cout << "\n";

    for (int iLoop = 0; iLoop < nLoops; ++iLoop)
    {
        std::cout << "Loop " << iLoop << " begins.\n";
        {
            // bigImage will be destroyed as soon as it is out of scope.
            auto const bigImage = cv::UMat::zeros(imageWidth, imageWidth, CV_8UC1);
        }
        std::cout << "Buffer pool controller reserved size: " << bpcPtr->getReservedSize() << " B.\n";
        std::cout << "Buffer pool controller maximum reserved size: " << bpcPtr->getMaxReservedSize() << " B.\n";
        cv::ocl::finish();
        bpcPtr->freeAllReservedBuffers();
        std::cout << "\n";
    }

    std::cout << "Success!";
}

Output:

[ INFO:0] global C:\build\master_winpack-build-win64-vc15\opencv\modules\core\src\ocl.cpp (1172) cv::ocl::haveOpenCL Initialize OpenCL runtime...
[ INFO:0] global C:\build\master_winpack-build-win64-vc15\opencv\modules\core\src\ocl.cpp (1178) cv::ocl::haveOpenCL OpenCL: found 2 platforms
[ INFO:0] global C:\build\master_winpack-build-win64-vc15\opencv\modules\core\src\ocl.cpp (970) cv::ocl::OpenCLExecutionContext::Impl::getInitializedExecutionContext OpenCL: initializing thread execution context
[ INFO:0] global C:\build\master_winpack-build-win64-vc15\opencv\modules\core\src\ocl.cpp (980) cv::ocl::OpenCLExecutionContext::Impl::getInitializedExecutionContext OpenCL: creating new execution context...
[ INFO:0] global C:\build\master_winpack-build-win64-vc15\opencv\modules\core\src\ocl.cpp (998) cv::ocl::OpenCLExecutionContext::Impl::getInitializedExecutionContext OpenCL: device=GeForce GTX 1050
[ INFO:0] global C:\build\master_winpack-build-win64-vc15\opencv\modules\core\src\ocl.cpp (5096) cv::ocl::Context::Impl::__init_buffer_pools OpenCL: Initializing buffer pool for context@0 with max capacity: poolSize=0 poolSizeHostPtr=0

Loop 0 begins.
[ INFO:0] global C:\build\master_winpack-build-win64-vc15\opencv\modules\core\src\ocl.cpp (396) cv::ocl::OpenCLBinaryCacheConfigurator::OpenCLBinaryCacheConfigurator Successfully initialized OpenCL cache directory: C:\Users\ANGELO~1.PER\AppData\Local\Temp\opencv\4.5\opencl_cache\
[ INFO:0] global C:\build\master_winpack-build-win64-vc15\opencv\modules\core\src\ocl.cpp (420) cv::ocl::OpenCLBinaryCacheConfigurator::prepareCacheDirectoryForContext Preparing OpenCL cache configuration for context: NVIDIA_Corporation--GeForce_GTX_1050--457_30
Buffer pool controller reserved size: 0 B.
Buffer pool controller maximum reserved size: 0 B.

Loop 1 begins.
OpenCL error CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) during call: clEnqueueNDRangeKernel('set', dims=2, globalsize=11776x46344x1, localsize=NULL) sync=false
OpenCV(4.5.0) Error: Unknown error code -220 (OpenCL error CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) during call: clEnqueueReadBuffer(q, handle=00000245F9E54D40, CL_TRUE, 0, sz=2147395600, data=0000024613FBE080, 0, 0, 0)) in cv::ocl::OpenCLAllocator::map, file C:\build\master_winpack-build-win64-vc15\opencv\modules\core\src\ocl.cpp, line 5674
OpenCV: terminate handler is called! The last OpenCV error is:
OpenCV(4.5.0) Error: Unknown error code -220 (OpenCL error CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) during call: clEnqueueReadBuffer(q, handle=00000245F9E54D40, CL_TRUE, 0, sz=2147395600, data=0000024613FBE080, 0, 0, 0)) in cv::ocl::OpenCLAllocator::map, file C:\build\master_winpack-build-win64-vc15\opencv\modules\core\src\ocl.cpp, line 5674

Observations:

bigImage is successfully allocated during loop 0, goes out of scope, but nonetheless on the next loop there is not enough GPU memory to allocate it again.
I read in the output: Initializing buffer pool for context@0 with max capacity: poolSize=0 poolSizeHostPtr=0. I would have naively expected the size of the buffer pool to be bigger than zero.

Topic		Replies	Views
OpticalFlowDual_TVL1_Impl::calc keep allocates memory C++ cuda , imgproc , optflow	7	37	February 20, 2025
cv2.VideoCapture.release does not free RAM memory Python videoio , valgrind , leaks	17	6733	June 8, 2021
How to completely release GpuMat memory C++ cuda	2	1919	December 2, 2021
Enhance feature: ocl::setUseOpenCL(false) releases refcount/resources opencl	4	1214	February 13, 2021
Android Studio still memory leak even release() Android/Java	3	353	October 21, 2023

How to ensure that GPU memory is actually deallocated after an OpenCV T-API function call?

Related topics