Performance Issue with OpenCV CUDA Optical Flow Compared to CPU Implementation
Background:
I have compiled OpenCV 4.8 with CUDA support and Python bindings in a virtual environment on a Windows system. My goal was to leverage the GPU acceleration for optical flow calculations to achieve faster performance compared to the CPU-based implementation.
Rig Specifications:
- RAM: 16 GB
- GPU: NVIDIA RTX 3070 Ti (8 GB)
- CPU: AMD Ryzen 7 2700
- Software:
- OpenCV Version: 4.8.0
- CUDA Toolkit Version: 11.2
- Python Version: 3.8.10
- Development Environment: VSCode
Test-Video: slow_traffic_small.mp4
Issue Description:
While testing a naive GPU implementation of optical flow using OpenCV’s CUDA functions, I observed slower execution times on the GPU compared to the CPU. This outcome contradicts my expectations based on the hardware capabilities and the nature of the tasks being offloaded to the GPU.
Code Under Test:
The code snippet in question is a part of a sparse optical flow calculation using OpenCV’s CUDA modules, sourced from an online repository: opencv-experiments by CudaWarped. The key part involves capturing video frames, converting them to grayscale, detecting features, and then calculating optical flow for these features over successive frames.
# Simplified code snippet for brevity
cap = cv2.VideoCapture(vidPath)
# Initialization and feature detection omitted for brevity
optFlow = cv2.cuda_SparsePyrLKOpticalFlow.create()
while True:
ret, frame = cap.read()
if not ret: break
# Grayscale conversion, optical flow calculation, and drawing omitted for brevity
cv2.destroyAllWindows()
cap.release()
Expected vs. Actual Outcome:
- Expected: Based on documentation and benchmarks, I anticipated the GPU implementation to outperform the CPU, specifically expecting around a 56% speed up compared to the CPU’s execution time.
- Actual Results:
- CPU Execution Time: 1.0905214 seconds
- GPU Execution Time: 1.6272229 seconds
- Contrary to expectations, the GPU implementation was approximately 48% slower than the CPU.
To confirm that Python OpenCV CUDA is configured and functioning correctly, I also ran another test from the other CudaWarped Optimization repository involving background subtraction with CUDA, which yielded an expected speed boost over the CPU.
Additional Information:
- CUDA device information was verified through OpenCV’s
cuda.printCudaDeviceInfo(0)
, confirming the presence and specifications of the NVIDIA GeForce RTX 3070 Ti. - Both the CPU and GPU tests were conducted under the same environmental conditions to ensure comparability.
Question:
-
Given the specifications of my rig and the software versions used, what could be contributing to the observed slower performance of the CUDA-based optical flow calculation compared to its CPU counterpart? Especially, if you can highlight why the huge difference between the expected case even for naive implementation.
-
Are there any potential optimizations or configuration adjustments that I might be missing to fully leverage the GPU acceleration for optical flow calculations in OpenCV?