Reading and Writing Videos: Python on GPU with CUDA - VideoCapture and VideoWriter

I’m trying to crop a video, using Python 3+, by reading it frame-by-frame and write certain frames to a new video.

I want to use GPU to speed up this process, as for a 1h video, it would take my CPU ~24h to complete.

My understanding is,
Reading a video using CPU:

vid = cv2.VideoCapture(vid_path)
fps = int(vid.get(cv2.CAP_PROP_FPS))
total_num_frames = int(vid.get(cv2.CAP_PROP_FRAME_COUNT))
frame_width = int(vid.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(vid.get(cv2.CAP_PROP_FRAME_HEIGHT))

Writing a video using CPU:

fourcc = cv2.VideoWriter_fourcc(*'mp4v')
new_vid = cv2.VideoWriter(new_vid_path, fourcc, fps, (frame_width, frame_height))  

Reading a frame on CPU, uploading & downloading it to/from GPU, then writing it using CPU:

ret, frame =

gpu_frame = cv2.cuda_GpuMat()
frame =


Note: I know uploading and downloading here is useless, I wrote it to express how I think the syntax should be used.

Is there a way to do this all on GPU?
Or is there another more efficient way to accomplish my goal?

The quick answer is yes using cv2.cudacodec.VideoReader(). You can see a comparison here.

The long answer is you will need “install” Nvidia Video Codec SDK before you build and the latest version needs a slight modification to OpenCV to work. Unfortunately Nvidia don’t provide links to older versions. If you are interested I will send you the OpenCV patch when I find it?

Unfortunately cv2.cudacodec.VideoWriter() doesn’t work so you would need to dload the frame but you could speed this up using pinned memory and streams or split the video and use threads.


Thanks for the reply!

I’m new to all this Nvidia and building form scratch stuff, but it’d be great if you could send me the OpenCV patch if and when you find it.
Not sure if you can on these forums, so heres my LinkedIn:

As for VideoWriter, do you have an example of using pinned memory or threads for this use case? I have quite shallow knowledge on those topics.

In my understanding,

  • what you want to do is just cropping some range (in time, not in 2D geometry) from input video
  • the reason why you are currently using GPU is only for boosting the decoding and encoding time (i.e. you do not process the video frames)

If above conditions are all correct, you should consider using ffmpeg, not opencv.

For examle, if you want to crop the video, you can do it without decoding nor encoding by a command:

ffmpeg -ss ’start time’ -t ‘duration’ -i ‘inout video’ -c:v copy ‘outout video’

start time: desired initial frame’s timestamp
duration: desired output video length

Unfortunately, you can not set the start frame and duration by frame number. You must use ‘time’. (ex. “01:23.456” for 1 minute and 23.456 seconds) I know this is pretty messy, but it is the specifications of ffmpeg. If you do the cropping without transcodong, it can work almost like memory copy. As a result, it works extremery fast speed (over 1000fps). So, I recommend to try ffmpeg.

1 Like

Thanks for your input!

I simplified my post. Actually here’s what I want to do:

  1. Process (I assume this means decode) frames and decide, by some metric, whether to crop that frame or not
  2. If I decide to crop it, then write that frame (encode?) to a new video.

Would ffmpeg still work? (Can I use ffmpeg-python ( instead?)

How about using OpenCV CUDA’s VideoReader (GPU) for reading, then vanilla OpenCV’s VideoWriter (CPU) for writing?
(This idea is based on how @cudawarped said that v2.cudacodec.VideoWriter() doesn’t work)

Ah, you want to make a video of some selected (uncontinuous, maybe) frames. Then, ffmpeg’s ‘copy’ method can’t help it, because encoded video packets strongly depend on the continuity of containing frames. (To understand why the frame continuity is related to this topic, you must know how video compression works and what is keyframe and non-keyframe.)

Anyway, to pick up arbitrary (uncontinuous) frames from input video, you must execute both encode and decode – as you wrote at first.
So, your approach of using NVIDIA’s HW codec is one of the best solutions. (If you do not have NVIDIA GPU, Intel’s HW codec can be another option.)

1 Like

Yes, I’d like to pick (possibly) uncontinuous frames.
I’m running Ubuntu 20.04 on an RTX2080, so I’ll try to setup OpenCV with CUDA capabilities and Nvidia’s Video Codec SDK.
Thanks for all the help!
(For anyone else checking this post, I’m currently reading a 1h video of 25 fps in ~3h on CPU)

Hi, what processing are you doing on the GPU and what resolution is your video? I am asking because it doesn’t sound like the decoding is your bottleneck. If it is and you can do the processing on the CPU and you have an Intel CPU and your file is in a raw format (e.g. .h264,.h265…) then I would also recommend using the Intel Quick Sync HW decoder as suggested by ChaoticActivity, i.e.

cap = cv2.VideoCapture(VID_PATH,cv2.CAP_INTEL_MFX)

On the latest Intel CPU this should be as quick as the Nvidia HW decoder, plus it may be much quicker at writing than using Ffmpeg etc.

If you have to use the GPU I would recommend pre-allocating your GpuMat arrays and passing them to your functions as the dst argument if this is possible, i.e. use

frame = cv2.cuda_GpuMat(n,m,cv2.CV_8UC4)

instead of

frame =

because the later will force memory to be allocated for frame on every invocation, slowing things down significantly.

To “patch” cudacodec to work with Nvidia Video Codec SDK 11.0 you need to add the AV1 codec, see this.

I don’t have an example of using threads with python to hand but you would need to split your video into separate files equal to the number of threads, so you would need to check if this approach is suitable for the processing you are doing.

To use pinned memory and streams in python see here. Using streams would give you the ability to decode frame 0 on the GPU, perform your processing and start the download in stream 0, then move to decoding frame 1, processing, download in stream 1, then start writing frame 0. In this scenario the downloading and writing of frame 0 may happen at the same time as the decoding or processing on the GPU. Essentially using streams would allow you to perform processing on the CPU and GPU at the same time and avoid the overhead of downloading from the device to the host, if you optimize your code in the right way. Unfortunately this will very much depend on exactly what you are doing and the above may not be optimal for you, for example you may have the resources available to perform both the decode and encode on the CPU because the time spent processing on the GPU is so large.

I’m using PyTorch on GPU, and trying to read each frame, and let’s say, see if there’s a cat in the frame. I want to extract all frames of cats, and create a new video with only those frames.
My videos are in 1080p (1920x1080).

To explain my thought process so you can see what the bottleneck is, (hopefully you’re familiar with PyTorch):
OpenCV CUDA loads frames into PyTorch DataLoader (which I’ll set num_workers=4 and pinned_memory=True), then DataLoader sends frames to model.

I sort of see what you mean in your pinned memory and streams explanation. Hopefully PyTorch’s DataLoader can take care of that.

I’m completely lost on the OpenCV CUDA code and how to patch OpenCV CUDA using the AV1 codec to work with the Video Codec SDK 11.0

Note: I’d like to do everything with GPU to keep things future-proof, since I’m unsure if I’ll always use an Intel CPU.

Overall, do you think my pipeline would work out?
I managed to do some testing, the reading part takes ~3h (without the DataLoader), and the writing ~1h.

So this is just for inference? I’m double checking because if its for training its maybe better to write the frames out so you can randomize the order more efficiently.

Are you sure the decoding is the bottleneck and not the inference or the implementation of your DataLoader? I understand that you want to future proof but it may be an idea to get a benchmark of how long it takes if you read the frames from a folder vs CPU decoding in the DataLoader vs GPU decoding.

In the past I used fastai which was incredibly fast at loading images from disk, not sure if its DataLoader still has many modifications over the standard pytorch one or not but you could check that out.

On a related if you are on linux and your version of Ffmpeg is built with Nvidia HW acceleration you should be able to set the below variable


and cv::VideoCapture()should perform the video decoding on your GPU.

I should have mentioned that this is currently going to be inefficient because you can’t pass a GpuMat into pytorch. As a result you would have to download from the device to the host each GpuMat containing the decoded frame of your video and then pass the resulting numpy array to pytorch where you would have to upload it from the device to the host again. In your original post I assumed the processing would all happen in OpenCV. As you are passing the decoded frames to pytorch you will probably find it more efficient to decode on the CPU, unless the DataLoader can consume faster than your CPU can decode.

Currently in OpenCV the DNN module only takes Mat input but there is a plan to enable it to take GpuMat as well. If/when this is implemented you could export your pytorch model to ONNX so it can be used in OpenCV, then have your whole pipeline in OpenCV.

Yes, just for inference. I’m I’ve manually gotten my own training data.

Honestly, initially I wasn’t sure what my bottle neck was - I just saw a a GPU version of VideoCapture and thought it could be the problem.
I do know that inference isn’t the issue.
Possibly the DataLoader “implementation”, since I’m currently not using a DataLoader - just reading the frames using VideoCapture one-by-one and passing a batch to my model.
I’m going to try to get the DataLoader implementation working ASAP.

Could you translate to Python for me? I don’t read C++ at all.
That seems like a really useful tip - I’m on Ubuntu 20.04 it’s possible.

Thanks for the explanation. Alright so that scratches my original plan. I’ll just implement: OpenCV (CPU) --> DataLoader --> Model, and see if there’s a significant speedup.
(Currently it’s OpenCV (CPU) --> Manual Batching --> Model)
Seems like a good first step, then I’ll set OPENCV_FFMPEG_CAPTURE_OPTIONS=video_codec;h264_cuvid when I get the Python code.

You said it’d be a good idea to benchmark times for:

  1. Reading frames from a folder
  2. CPU decoding in the DataLoader
  3. GPU decoding.

But to read frames in the folder (1), wouldn’t I first have to get the frames? AKA decode the frames, then encode them again? Wouldn’t that intuitively take longer than just reading the frames with CPU (my current method), and batching them manually?
How would you create a Dataset & DataLoader to read frames as they come in on CPU? (2) (This way, I can still use threads and pinned memory)

I guess (3) would be using the same thing as (2), but with OPENCV_FFMPEG_CAPTURE_OPTIONS=video_codec;h264_cuvid

What do you need translating?

Yes but that would give you a baseline as the DataLoader in pytorch should efficiently load your frames stored as image files in advance.
Regarding 2 I think you can init VideoCapture() in your DataLoader and have it read frames instead of reading from disk in the __getitem__ method. When I suggested the comparison I thought you already had the DataLoader working.

Is this Python? What does ;h264_cuvid do?

This is the Dataset I just implemented, which ended up freezing my computer - I think because I tried saving all 3000 images in a variable.

class FramesDataset(Dataset):
    '''Custom dataset for frames from a video read by OpenCV'''
    def __init__(self, vid_path, frame_transforms, start_frame, end_frame):
            vid_path:            Path to original video 
            frame_transforms:    Transformation to make on the frame 
            start_frame:         -----
            end_frame:           -----
        self.frames = []
        self.transform = frame_transforms
        vid = cv2.VideoCapture(vid_path)
        total_num_frames = int(vid.get(cv2.CAP_PROP_FRAME_COUNT))
        start_frame = start_frame
        end_frame = end_frame
        current_frame = 0
        start = time.time()
        end = start 
        total_process_time = 0
        # Load & save frames 
        for frame_num in range(start_frame, end_frame):
            # Set frame to capture every `fps` frames 
            vid.set(cv2.CAP_PROP_POS_FRAMES, frame_num)
            ret, frame =

            if ret:

                # Print statistics
                current_frame = frame_num - start_frame
                if current_frame == 0:
                process_time = time.time() - end
                total_process_time += process_time
                average_process_time = total_process_time / current_frame
                num_frames_left = total_num_frames - current_frame

                eta = average_process_time * num_frames_left
                eta_h = int(eta // 60 // 60)
                eta_m = int((eta - eta_h*60*60) // 60)
                eta_s = int((eta - eta_m*60) % 60)
                end = time.time()
                time_h = int((end-start) // 60 // 60)
                time_min = int(((end-start) - time_h*60*60) // 60)
                time_sec = round(((end-start) - time_min*60) % 60)
                sys.stdout.write('\rProcessing Frame: {:.2f}% {}/{}     Time Elapsed: {} h {} min {} s     ETA: {} h {} m {} s'
                                 .format(current_frame*100/total_num_frames, current_frame, total_num_frames, 
                                         time_h, time_min, time_sec,
                                         eta_h, eta_m, eta_s))
    def __len__(self):
        return len(self.frames)
    def __getitem__(self, idx):
        sample = self.frames[idx]
        if self.transform:
            sample = self.transform(sample)
        return frame[idx]

vid = cv2.VideoCapture(vid_path)
total_num_frames = int(vid.get(cv2.CAP_PROP_FRAME_COUNT))
fps = int(vid.get(cv2.CAP_PROP_FPS))

mean = torch.tensor([0.5], dtype=torch.float32)
std = torch.tensor([0.5], dtype=torch.float32)
frame_transforms = transforms.Compose([transforms.Resize((720, 720)),
                                       transforms.Normalize(mean, std)]) 

start_time = '0:05:30'
start_time_h = int(start_time[0])
start_time_m = int(start_time[2:4])
start_time_s = int(start_time[5:7])
start_frame = (start_time_h*60*60 + start_time_m*60 + start_time_s) * fps

end_time = '1:17:50'
end_time_h = int(end_time[0])
end_time_m = int(end_time[2:4])
end_time_s = int(end_time[5:7])
end_frame = (end_time_h*60*60 + end_time_m*60 + end_time_s) * fps

frames_dataset = FramesDataset(vid_path, frame_transforms, start_frame, end_frame)

batch_size = 100
frame_loader = DataLoader(frames_dataset, batch_size=batch_size, shuffle=True, num_workers=4, pinned_memory=True)

If I were to edit it to store frames as .png, it’d take a lot of space.
~1.7MB per frame, I have ~100,000 frames, so 1.7TB, which is space I don’t have.

This sets the environmental OPENCV_FFMPEG_CAPTURE_OPTIONS variable which should force VideoCapture() when opened with cv2.CAP_FFMPEG

cap = cv2.VideoCapture(VID_PATH,cv2.CAP_FFMPEG)

to use the Nvidia HW decoder for decoding h264 and h265 if your version of FFMPEG was built with this functionality. This works on windows but I have not tested it on linux.

I think your __init__ function needs to init VideoCapture() and get a length. Then __getitem__ should read() the next frame. I am not a python or a pytorch expert but so I may be wrong. My suggestion regarding the frames is what I would do, on a smaller sample of video to get a baseline. Anyway good luck with your project as I think we are diverging away from OpenCV by discussing pytorch DataSets/Loaders.

Thanks so much for your help. All this information is really helpful. I’ll try to use Nvidia HW SDK for speed it up. If it’s not possible, 4h isn’t that long a wait for a video.
Thanks again, and take care.