OpenCV Optical Flow Cuda Naiva Implementation Slower then CPU

Performance Issue with OpenCV CUDA Optical Flow Compared to CPU Implementation

Background:

I have compiled OpenCV 4.8 with CUDA support and Python bindings in a virtual environment on a Windows system. My goal was to leverage the GPU acceleration for optical flow calculations to achieve faster performance compared to the CPU-based implementation.

Rig Specifications:

  • RAM: 16 GB
  • GPU: NVIDIA RTX 3070 Ti (8 GB)
  • CPU: AMD Ryzen 7 2700
  • Software:
    • OpenCV Version: 4.8.0
    • CUDA Toolkit Version: 11.2
    • Python Version: 3.8.10
    • Development Environment: VSCode

Test-Video: slow_traffic_small.mp4

Issue Description:

While testing a naive GPU implementation of optical flow using OpenCV’s CUDA functions, I observed slower execution times on the GPU compared to the CPU. This outcome contradicts my expectations based on the hardware capabilities and the nature of the tasks being offloaded to the GPU.

Code Under Test:

The code snippet in question is a part of a sparse optical flow calculation using OpenCV’s CUDA modules, sourced from an online repository: opencv-experiments by CudaWarped. The key part involves capturing video frames, converting them to grayscale, detecting features, and then calculating optical flow for these features over successive frames.

# Simplified code snippet for brevity
cap = cv2.VideoCapture(vidPath)
# Initialization and feature detection omitted for brevity
optFlow = cv2.cuda_SparsePyrLKOpticalFlow.create()
while True:
    ret, frame = cap.read()
    if not ret: break
    # Grayscale conversion, optical flow calculation, and drawing omitted for brevity
cv2.destroyAllWindows()
cap.release()

Expected vs. Actual Outcome:

  • Expected: Based on documentation and benchmarks, I anticipated the GPU implementation to outperform the CPU, specifically expecting around a 56% speed up compared to the CPU’s execution time.
  • Actual Results:
    • CPU Execution Time: 1.0905214 seconds
    • GPU Execution Time: 1.6272229 seconds
    • Contrary to expectations, the GPU implementation was approximately 48% slower than the CPU.

To confirm that Python OpenCV CUDA is configured and functioning correctly, I also ran another test from the other CudaWarped Optimization repository involving background subtraction with CUDA, which yielded an expected speed boost over the CPU.

Additional Information:

  • CUDA device information was verified through OpenCV’s cuda.printCudaDeviceInfo(0), confirming the presence and specifications of the NVIDIA GeForce RTX 3070 Ti.
  • Both the CPU and GPU tests were conducted under the same environmental conditions to ensure comparability.

Question:

  1. Given the specifications of my rig and the software versions used, what could be contributing to the observed slower performance of the CUDA-based optical flow calculation compared to its CPU counterpart? Especially, if you can highlight why the huge difference between the expected case even for naive implementation.

  2. Are there any potential optimizations or configuration adjustments that I might be missing to fully leverage the GPU acceleration for optical flow calculations in OpenCV?

The naiive implementation (Naive CUDA implementation without pre-alloc, streams or other optimizations)

  1. allocates the return arrays on the GPU in each iteration which is costly, and
  2. calls cudaDeviceSynchronize (hard sync) on every iteration because you are not passing a cuda stream. As a result the timing will be off if you timed it with the code from that notebook which uses CPU not GPU timers. This due to the synchronization will include the latency (time between calling optFlow.calc and it execution) of every kernel launch.

That said I have no idea if the code will be faster on your RTX 3070 than your Ryzon 7 2700.

Thank you, @cudawarped. Your resources on CUDA optimization in OpenCV are invaluable.

Regarding your answer, I have the following points of confusion that I still don’t understand:

  1. Sparse Optical Flow CPU vs. GPU Performance:
    Your comparison of sparse optical flow on CPU vs. GPU showed a 56% speed up on the GPU using the same naive implementation. Even if my rig is not the same, the performance shouldn’t reverse to a 40% decrease, should it?

  2. Comparison with Background Subtraction Test:
    | That said I have no idea if the code will be faster on your RTX 3070 than your Ryzon 7 2700.
    As I mentioned, I tested your optimization repository’s naive CPU vs. GPU comparison for background subtraction, and it gave me an 11x speed boost using the following code:

bgmog2_device = cv.cuda.createBackgroundSubtractorMOG2()
def ProcFrameCuda0(frame, lr, store_res=False):
    frame_device.upload(frame)
    frame_device_big = cv.cuda.resize(frame_device, (cols_big, rows_big))
    fg_device_big = bgmog2_device.apply(frame_device_big, lr, cv.cuda.Stream_Null())
    fg_device = cv.cuda.resize(fg_device_big, frame_device.size())
    fg_host = fg_device.download()
    if(store_res):
        gpu_res.append(np.copy(fg_host))

gpu_res = []
gpu_time_0, n_frames = ProcVid0(partial(ProcFrameCuda0, store_res=check_res), lr)
print(f'GPU 0 (naive): {n_frames} frames, {gpu_time_0:.2f} ms/frame')
print(f'Speedup over CPU: {cpu_time_0/gpu_time_0:.2f}')

Could the size of the image/frame that we are processing be the cause? In the optical flow case, the test video was 640x360. And in background subtraction, the test image is 1440x2560 (after-resizing).

Also, the OpenCV version you were using in Sparse Optical Flow is 4.1, and I have 4.8. Maybe something changed in the versions that is causing this?

My idea was to get a naive implementation working as it should and then investigate optimizations. If that makes sense. :innocent:

The timing on the CPU is not accurate. I wrote that notebook to demonstrate how to call the cuda version of SparsePyrLKOpticalFlow not to compare the performance of the two which is why I explicitely called it Naive CUDA implementation without pre-alloc, streams or other optimizations. I’m not even sure why the timing is there but it should not be used as a reference.

The best way to time CUDA is to use device timers (CUDA Events) but then the comparisson with the CPU version becomes unfair.

If you want to use CPU timers then I would advise removing all the CPU calls from the GPU version and timing the execution over many frames as a whole. e.g.

# Load all frames
frames_device = []
frames = []
ret,frame = cap.read()
while(ret):
    frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    frame_gray_device.upload(frame_gray)
    frames.append(frame_gray.copy())
    frames_device.append(frame_gray_device.clone())
    ret,frame = cap.read()

# GPU
old_gray_device = cv2.cuda_GpuMat(cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY))
detector_device =  cv2.cuda.createGoodFeaturesToTrackDetector(cv2.CV_8UC1, feature_params['maxCorners'], \
                                           feature_params['qualityLevel'], feature_params['minDistance'], \
                                           feature_params['blockSize'])
p0_device = detector_device.detect(old_gray_device)
p1_device = cv2.cuda_GpuMat(p0_device.size(), p0_device.type())
st_device = cv2.cuda_GpuMat(p0_device.size(),cv2.CV_8U)
err = cv2.cuda_GpuMat(p0_device.size(),cv2.CV_32F)
optFlow = cv2.cuda_SparsePyrLKOpticalFlow.create()
stream = cv2.cuda.Stream()
t = time.perf_counter()
for frame_gray_device in frames_device:
    _,_,_ = optFlow.calc(old_gray_device,frame_gray_device,p0_device,p1_device, status = st_device, err=err, stream = stream)
    frame_gray_device.copyTo(stream, old_gray_device)
    p1_device.copyTo(stream, p0_device)
stream.waitForCompletion()
etime = (time.perf_counter() - t)
print(etime)

# CPU
old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY)
p0 = cv2.goodFeaturesToTrack(old_gray, mask = None, **feature_params)
t = time.perf_counter()
for frame_gray, frame in zip(frames, frames_rgb):
    p1,st,err = cv2.calcOpticalFlowPyrLK(old_gray, frame_gray, p0,None, **lk_params)    
    old_gray = frame_gray.copy()
    p0 = p1

etime = (time.perf_counter() - t)
print(etime)

Note: I haven’t checked that the results are correct the above is just an example of what I mean by removing all CPU calls.

I would expect that larger images would get more of a performance boost on the GPU but I haven’t checked this. Once you have your timing sorted this should be easy for you to verify by resizing the source images before processing.