OpticalFlowDual_TVL1_Impl::calc keep allocates memory

n37jan · February 17, 2025, 6:56am

Hi

I’m building an application that uses OpenCV’s OpticalFlowDual_TVL1_Impl. It is basically gets sequence of image frames, and do optical flow motion vector estimation. While I was doing performance profiling using Nsight System, I found that there’s memory allocation and freeing everytime I’m calling OpticalFlowDual_TVL1_Impl::calc.

I found where the memory allocation happens in the source code:

    void OpticalFlowDual_TVL1_Impl::calc(InputArray _frame0, InputArray _frame1, InputOutputArray _flow, Stream& stream)
    {
        const GpuMat frame0 = _frame0.getGpuMat();
        const GpuMat frame1 = _frame1.getGpuMat();

        BufferPool pool(stream);
        **GpuMat flowx = pool.getBuffer(frame0.size(), CV_32FC1);**
        **GpuMat flowy = pool.getBuffer(frame0.size(), CV_32FC1);**

        calcImpl(frame0, frame1, flowx, flowy, stream);

        GpuMat flows[] = {flowx, flowy};
        cuda::merge(flows, 2, _flow, stream);
    }

The BufferPool’s getBuffer method is keep doing memory allocation, and deallocates it automatically when it is out of scope every time the ‘calc’ is called.

I’ve looked up the BufferPool, and it is basically memory pool that tries to reduce actual memory allocation/deallocation calls.

So, I have followed the example on here: OpenCV: cv::cuda::BufferPool Class Reference

...
setBufferPoolUsage(true);
setBufferPoolConfig(getDevice(), 1024 * 1024 * 64, 2);

Stream stream1;

...
cv::Ptr<cv::cuda::OpticalFlowDual_TVL1> motionEstimatorCUDA;
motionEstimatorCUDA = cv::cuda::OpticalFlowDual_TVL1::create(tau, lambda, theta, nscales, warps, epsilon, iterations, scaleStep, gamma, useInitialFlow);
...
...
motionEstimator->calc(mat1, mat2, outputFlow);

But it is still keep allocating/deallocating memories whenever pool.getBuffer is called.

What am I missing?

cudawarped · February 17, 2025, 12:30pm

In Nsight System how are you determining that memory allocation is being performed, calls to cudaMallocPitch? If so how do you know these are related to flowx[y] and not from somewhere else?

Internally there are a lot of allocations performed

github.com/opencv/opencv_contrib

modules/cudaoptflow/src/tvl1flow.cpp

ce3c6681c


      
              flowy.create(I0.size(), CV_32FC1);
          }
          
          u1s[0] = flowx;
          u2s[0] = flowy;
          if (gamma_)
          {
              u3s[0].create(I0.size(), CV_32FC1);
          }
          
          I1x_buf.create(I0.size(), CV_32FC1);
          I1y_buf.create(I0.size(), CV_32FC1);
          
          I1w_buf.create(I0.size(), CV_32FC1);
          I1wx_buf.create(I0.size(), CV_32FC1);
          I1wy_buf.create(I0.size(), CV_32FC1);
          
          grad_buf.create(I0.size(), CV_32FC1);
          rho_c_buf.create(I0.size(), CV_32FC1);
          
          p11_buf.create(I0.size(), CV_32FC1);

although looking at the code these should only happen the first time OpticalFlowDual_TVL1_Impl::calcImpl() is called.

I’ve just run a quick test below to verify BufferPool is working correctly and I cannot see any issues with that.

constexpr int w = 1024*10, h = w;
constexpr int n = 5;
GpuMat frame0(h, w, CV_8UC1);
GpuMat frame1(h, w, CV_8UC1);
DeviceInfo info;

{
    const size_t memFreeStart = info.freeMemory();
    for (int i = 0; i < 5; i++) {
        const size_t memFreeBeforeAllocation = info.freeMemory();
        GpuMat flowx(frame0.size(), frame0.type());
        GpuMat flowy(frame1.size(), frame1.type());
        const size_t memFreeAfterAllocation = info.freeMemory();
        std::cout << "Memory allocated in loop: " << (memFreeBeforeAllocation - memFreeAfterAllocation) / (1024.0 * 1024.0) << "MB" << std::endl;
    }
    const size_t memFreeEnd = info.freeMemory();
    std::cout << "Memory allocated after loop: " << (memFreeStart - memFreeEnd) / (1024.0 * 1024.0) << "MB" << std::endl;
}

{
    setBufferPoolUsage(true);
    setBufferPoolConfig(getDevice(), h * w * 2, 1);
    Stream stream = Stream();
    BufferPool pool(stream);
    const size_t memFreeStart = info.freeMemory();
    for (int i = 0; i < 5; i++) {
        const size_t memFreeBeforeAllocation = info.freeMemory();
        GpuMat flowx = pool.getBuffer(frame0.size(), frame0.type());
        GpuMat flowy = pool.getBuffer(frame1.size(), frame1.type());
        const size_t memFreeAfterAllocation = info.freeMemory();
        std::cout << "Memory allocated in loop: " << (memFreeBeforeAllocation - memFreeAfterAllocation) / (1024.0 * 1024.0) << "MB" << std::endl;
    }
    const size_t memFreeEnd = info.freeMemory();
    std::cout << "Memory allocated after loop: " << (memFreeStart - memFreeEnd) / (1024.0 * 1024.0) << "MB" << std::endl;
}

n37jan · February 17, 2025, 3:58pm

Hi. Thanks for your reply.

Yes, I found that ‘cudaMallocPitch’ is called twice, and other memories are fine, they’re allocated just once.

I confirmed it with debug mode build, so I could step into the source code, so the ‘getBuffer’ is calling the ‘cudaMallocPitch’ every time.

The ‘getBuffer’ internally calls ‘gpuMat::create’ and it is always create new array because the ‘row’ and ‘col’ is always set to zeros

On your code

cudawarped:

    setBufferPoolUsage(true);
    setBufferPoolConfig(getDevice(), h * w * 2, 1);
    Stream stream = Stream();
    BufferPool pool(stream);
    const size_t memFreeStart = info.freeMemory();
    for (int i = 0; i < 5; i++) {
        const size_t memFreeBeforeAllocation = info.freeMemory();
        GpuMat flowx = pool.getBuffer(frame0.size(), frame0.type());
        GpuMat flowy = pool.getBuffer(frame1.size(), frame1.type());
        const size_t memFreeAfterAllocation = info.freeMemory();
        std::cout << "Memory allocated in loop: " << (memFreeBeforeAllocation - memFreeAfterAllocation) / (1024.0 * 1024.0) << "MB" << std::endl;
    }

the ‘BufferPool pool(stream)’ is defined before the for-loop, but ‘OpticalFlowDual_TVL1_Impl’ is creating its pool everytime the ‘calc’ is called. Maybe that is why?

cudawarped · February 17, 2025, 4:29pm

I would check this again, if rows and cols are zero then rows !=_rows as _rows,_cols are the size (frame0.size()) passed into getBuffer

pool.getBuffer(frame0.size(), CV_32FC1);

You want to check the status of allocSuccess

github.com/opencv/opencv

modules/core/src/cuda/gpu_mat.cu

21402668a


      
          if (_rows > 0 && _cols > 0)
          {
              flags = Mat::MAGIC_VAL + _type;
              rows = _rows;
              cols = _cols;
          
              const size_t esz = elemSize();
          
              bool allocSuccess = allocator->allocate(this, rows, cols, esz);
          
              if (!allocSuccess)
              {
                  // custom allocator fails, try default allocator
                  allocator = defaultAllocator();
                  allocSuccess = allocator->allocate(this, rows, cols, esz);
                  CV_Assert( allocSuccess );
              }
          
              if (esz * cols == step)
                  flags |= Mat::CONTINUOUS_FLAG;

but make sure it is called by

        GpuMat flowx = pool.getBuffer(frame0.size(), CV_32FC1);
        GpuMat flowy = pool.getBuffer(frame0.size(), CV_32FC1);

and not the other create methods inside the OpticalFlowDual_TVL1_Impl::calcImpl() function. If its false then step into the allocate method to see what the issue is.

I wondered about that but in my test code moving the BufferPool creation inside the loop still worked as expected.

n37jan · February 17, 2025, 7:21pm

Hi, thanks for your reply and advice.

After some work on my code per your advice, now the allocation/deallocation of Bufferpool is working as expected. However, now I’m having different problem.

I should reveal more on my code to get advice. Here’s my code:

class OpenCVOptFlow
{
public:
    OpenCVOptFlow(cudaStream_t cudaStream)
    {
        // Optical Flow Parameters
        const double tau = 0.25;
        double lambda = 0.1;
        const double theta = 0.3;
        int nscales = 2;
        const int warps = 4;
        const double epsilon = 0.01;
        int iterations = 2;
        const double scaleStep = 0.5;
        const double gamma = 0;
        const int medianFiltering = 0;
        const bool useInitialFlow = false;

        cv::cuda::setBufferPoolUsage(true);
        cv::cuda::setBufferPoolConfig(cv::cuda::getDevice(), 1024 * 1024, 10);
        myOpenCVCUDAStream = cv::cuda::StreamAccessor::wrapStream(cudaStream);
        myMotionEstimatorCUDA = cv::cuda::OpticalFlowDual_TVL1::create(tau, lambda, theta, nscales, warps, epsilon, iterations, scaleStep, gamma, useInitialFlow);
    }

    void run(unsigned char* I0, 
        unsigned char* I1,        
        int numRows,
        int numCols,
        int srcPitch,
        float* dstFlow,
        int dstPitch)
    {
        // wrapping native GPU pointers with cv::cuda::GpuMat
        cv::cuda::GpuMat srcMat = cv::cuda::GpuMat(numRows, numCols, CV_8U, I0, srcPitch);
        cv::cuda::GpuMat prevMat = cv::cuda::GpuMat(numRows, numCols, CV_8U, I1, srcPitch);
        cv::cuda::GpuMat dstFlow = cv::cuda::GpuMat(numRows, numCols, CV_32FC2, dstFlow, dstPitch);

        // run Optical flow
        myMotionEstimatorCUDA->calc(srcMat, prevMat, dstFlow, myOpenCVCUDAStream); // running optical flow
        ...
        ...
    }

private:
    cv::cuda::Stream myOpenCVCUDAStream;
    cv::Ptr<cv::cuda::OpticalFlowDual_TVL1> myMotionEstimatorCUDA;
}

With this code, the memory allocation inside the OpticalFlowDual_TVL1::calc is just happening one time. However, I’m getting error on this part:

void OpticalFlowDual_TVL1_Impl::calcImpl(const GpuMat& I0, const GpuMat& I1, GpuMat& flowx, GpuMat& flowy, Stream& stream)
{
...
...
        if (!useInitialFlow_)
        {
            u1s[nscales_-1].setTo(Scalar::all(0), stream); <- cudaErrorInvalidValue
            u2s[nscales_-1].setTo(Scalar::all(0), stream);
        }
}

The GpuMat::setTo function calls cudaMemcpy2DAsync or cudaMemcpy2D depending on the stream like this, and I’m getting ‘cudaErrorInvalidValue’:

GpuMat& cv::cuda::GpuMat::setTo(Scalar value, Stream& stream)
{
    CV_DbgAssert( !empty() );
    CV_DbgAssert( depth() <= CV_64F && channels() <= 4 );

    if (value[0] == 0.0 && value[1] == 0.0 && value[2] == 0.0 && value[3] == 0.0)
    {
        // Zero fill

        if (stream)
            CV_CUDEV_SAFE_CALL( cudaMemset2DAsync(data, step, 0, cols * elemSize(), rows, StreamAccessor::getStream(stream)) ); <- cudaErrorInvalidValue
        else
            CV_CUDEV_SAFE_CALL( cudaMemset2D(data, step, 0, cols * elemSize(), rows) ); <- cudaErrorInvalidValue

        return *this;
    }

....

I tried to give cv::cuda::stream::Null() on ‘calc’ but the error is the same. What am I doing wrong?

By the way, the OpenCV version is 4.8, and 64bit compiled with VS2022 on Win10.

Thank you.

cudawarped · February 18, 2025, 6:19am

There could be multiple causes, I need to see a complete MRE if you want me to look into why its failing.

Its not a good idea to have the wrapped stream as a member of your class. Are you sure that the cuda stream your wrapping inside your class is not destroyed before the call to myMotionEstimatorCUDA->calc? It would be better to create the stream internally or pass it to your run method.

n37jan · February 20, 2025, 6:17am

I’ve checked it with Nsight, and it turns out that the memory allocated by the Bufferpool was in different device. I’m currently using 2 graphic cards, and the pool’s memory allocated in device 1 while the device 0 was selected. Very strange.

I’m quite sure that cuda stream is alive during the myMotionEstimatorCUDA->calc call.

Anyway, thank you for your support.

cudawarped · February 20, 2025, 6:36am

Make sure the device you are manually creating streams and other CUDA objects on is the same as the one OpenCV is using. The memory allocated in the pool should be on the device returned by the call to getDevice() so I would interrogate this.

The first thing I would do is explicitely set the device (call cudaSetDevice) before calling any other CUDA code to the device you want to use and checking that everything is allocated on the device you selected.

Topic		Replies	Views
When i release the gpumat from bufferPoll,but data is still in memory C++ cuda	7	451	May 17, 2022
How to use cuda::SparsePyrLKOpticalFlow in multi thread environment C++ multithreading , cuda , optflow	11	965	July 30, 2022
cuda_FarnebackOpticalFlow.calc Python cuda , optflow	2	764	February 20, 2023
Unified Memory problem C++ cuda	13	1029	May 4, 2023
How to completely release GpuMat memory C++ cuda	2	1868	December 2, 2021

OpticalFlowDual_TVL1_Impl::calc keep allocates memory

Related topics