cv2.VideoCapture
will only output host/CPU frames. I am not sure exactly how the hardware acceleration works internally.
cudacodec.VideoReader
decodes directly to device/GPU memory. If you build from the master branch you now have the option to output to BGR, BGRA, GRAY or NV12(YUV), with the default being BGRA. The decoder currently decodes everything to NV12, so if you choose BGR output format then it will run an extra CUDA kernel over the frame to perform the conversion.