How to use cuda::SparsePyrLKOpticalFlow in multi thread environment

keivan_moazami · July 26, 2022, 6:30am

When I using cuda::SparsePyrLKOpticalFlow in multi thread application failed to calc optical flow properly.
I run it on different cuda::GpuMats and in separate cuda::Streams with separate Algorithm instances.
I create new instance with cuda::SparsePyrLKOpticalFlow::create() method In each thread but have same problem.

cudawarped · July 26, 2022, 12:17pm

If you are getting different results when using seperate threads that would point to some global memory usage (this used to be common in functions which called CUDA NPP under the hood). The other possibility could be that you are not calling waitForCompletion() on your stream following cuda::SparsePyrLKOpticalFlow::calc.

In either case it might be worth testing with the default stream (cuda::Stream::Null()) or without streams (which should have the same effect) to see if you get constistent results accross threads. If not it may be cause by something else.

keivan_moazami · July 27, 2022, 9:02am

Thank you for your response.
I use waitForCompletion() after call calc() method and use separate stream in each thread.
It could be a bug ?
like this bug CUDA GoodFeaturesToTrackDetector is not ThreadSafe ? · Issue #18051 · opencv/opencv · GitHub
here is my code run in separate threads.

cuda::GpuMat GpuFrame;
GpuFrame.upload(frame, mForwardOpticalFlowStream);
cuda::GpuMat GpuLastFrame;
GpuLastFrame.upload(mLastFrame, mForwardOpticalFlowStream);
		
cuda::GpuMat gOldFeaturePoints, gNewFeaturePoints, gStatus, gErrors;
trackerUtils::uploadVector(allFeaturePoints, gOldFeaturePoints, CV_32FC2, mForwardOpticalFlowStream);
mSparseLK->calc(GpuLastFrame, GpuFrame, gOldFeaturePoints, gNewFeaturePoints, gStatus, gErrors, mForwardOpticalFlowStream);

trackerUtils::downloadVector(gStatus, status, CV_8UC1, mForwardOpticalFlowStream);
trackerUtils::downloadVector(gErrors, errors, CV_32FC1, mForwardOpticalFlowStream);
trackerUtils::downloadVector(gNewFeaturePoints, newFeaturePoints, CV_32FC2, mForwardOpticalFlowStream);
mForwardOpticalFlowStream.waitForCompletion();

cuda::GpuMat GpuFrameBackward;
GpuFrameBackward.upload(frame, mBackwardOpticalFlowStream);
cuda::GpuMat GpuLastFrameBackward;
GpuLastFrameBackward.upload(mLastFrame, mBackwardOpticalFlowStream);
cuda::GpuMat gNewBackwardFeaturePoints, gBackwardStatus, gBackwardErrors;
mSparseLKBackward->calc(GpuFrameBackward, GpuLastFrameBackward, gNewFeaturePoints, gNewBackwardFeaturePoints, gBackwardStatus, gBackwardErrors, mBackwardOpticalFlowStream);

trackerUtils::downloadVector(gBackwardStatus, backwardStatus, CV_8UC1, mBackwardOpticalFlowStream);
trackerUtils::downloadVector(gBackwardErrors, backwardErrors, CV_32FC1, mBackwardOpticalFlowStream);
trackerUtils::downloadVector(gNewBackwardFeaturePoints, newBackwardFeaturePoints, CV_32FC2, mBackwardOpticalFlowStream);
mBackwardOpticalFlowStream.waitForCompletion();

cudawarped · July 27, 2022, 9:33am

That doesn’t look like a bug, the routine wasn’t thread safe (the algorithm was using global memory as I mentioned above) until the author submitted this PR.

Are mSparseLK and mSparseLKBackward created inside each thread?

I’ll try to take a look later.

keivan_moazami · July 27, 2022, 9:36am

mSparseLK and mSparseLKBackward are both in one thread

cudawarped · July 27, 2022, 9:38am

Do you create a seperate instance of them in each thread, or create one instance of each and pass those to each thread?

keivan_moazami · July 27, 2022, 9:40am

I create a separate instance of them in each thread

cudawarped · July 27, 2022, 12:13pm

Just checked SparsePyrLKOpticalFlow is definitely not thread safe. It use constant and texture memory which are both globally defined

github.com

opencv/opencv_contrib/blob/9d0a451bee4cdaf9d3f76912e5abac6000865f1a/modules/cudaoptflow/src/cuda/pyrlk.cu#L61


      
          #include "opencv2/core/cuda/filters.hpp"
          #include "opencv2/core/cuda/border_interpolate.hpp"
          
          #include <iostream>
          
          using namespace cv::cuda;
          using namespace cv::cuda::device;
          
          namespace pyrlk
          {
              __constant__ int c_winSize_x;
              __constant__ int c_winSize_y;
              __constant__ int c_halfWin_x;
              __constant__ int c_halfWin_y;
              __constant__ int c_iters;
          
              texture<uchar, cudaTextureType2D, cudaReadModeNormalizedFloat> tex_I8U(false, cudaFilterModeLinear, cudaAddressModeClamp);
              texture<uchar4, cudaTextureType2D, cudaReadModeNormalizedFloat> tex_I8UC4(false, cudaFilterModeLinear, cudaAddressModeClamp);
          
              texture<ushort4, cudaTextureType2D, cudaReadModeNormalizedFloat> tex_I16UC4(false, cudaFilterModeLinear, cudaAddressModeClamp);

There appear to be versions which don’t use texture memory for both short and int types,

github.com

opencv/opencv_contrib/blob/9d0a451bee4cdaf9d3f76912e5abac6000865f1a/modules/cudaoptflow/src/cuda/pyrlk.cu#L774


      
                  else
                      sparseKernel<cn, PATCH_X, PATCH_Y, false, T> <<<grid, block, 0, stream >>>(prevPts, nextPts, status, err, level, rows, cols);
          
                  cudaSafeCall(cudaGetLastError());
          
                  if (stream == 0)
                      cudaSafeCall(cudaDeviceSynchronize());
              }
          };
          // Specialization to use non texture path because for some reason the texture path keeps failing accuracy tests
          template<int PATCH_X, int PATCH_Y> class sparse_caller<1, PATCH_X, PATCH_Y, unsigned short>
          {
          public:
              typedef typename TypeVec<unsigned short, 1>::vec_type work_type;
              typedef PtrStepSz<work_type> Ptr2D;
              typedef BrdConstant<work_type> BrdType;
              typedef BorderReader<Ptr2D, BrdType> Reader;
              typedef LinearFilter<Reader> Filter;
              static void call(Ptr2D I, Ptr2D J, int rows, int cols, const float2* prevPts, float2* nextPts, uchar* status, float* err, int ptcount,
                  int level, dim3 block, cudaStream_t stream)
              {

so I would guess but cannot confirm that these should be thread safe as long as you use the same window size and number of iterations per thread.

So in conclusion you could try short or int if applicable, re-write the routine to avoid textures or maybe use texture objects, use cuda::Stream::Null() which will hurt performance of maybe see how slow Nvidia’s hardware accelerated dense optical flow is for your problem.

keivan_moazami · July 29, 2022, 2:29pm

I use same win size & iteration in all threads.
what is short and int types ? pixel data type ?

cudawarped · July 29, 2022, 2:33pm

Yes pixel data type.

I would start by confirming that using the default stream cv::cuda::Stream::Null() gives you the same result in each thread. If so see if the execution time of that is acceptable. If not then try different data types.

keivan_moazami · July 30, 2022, 11:50am

(post deleted by author)

keivan_moazami · July 30, 2022, 11:51am

Yes problem resolved by convert pixel format from CV_8U to CV_16U
frame.convertTo(shortFrame, CV_16U);

Topic		Replies	Views
Replace cv::calcOpticalFlowPyrLK with cv::cuda::SparsePyrLKOpticalFlow api C++	2	618	October 19, 2023
cuda_FarnebackOpticalFlow.calc Python cuda , optflow	2	834	February 20, 2023
CUDA flag to create a cv::cuda::Stream that supports asynchronous calls C++ gpu , cuda	1	1119	July 20, 2022
Opencv cuda stream optimisation C++ cuda	1	1075	August 18, 2022
What are temp1,temp2 used for? C++ cuda , optflow	4	592	March 4, 2022

How to use cuda::SparsePyrLKOpticalFlow in multi thread environment

Related topics