Cuda operations give black frames

Hello,
I’m trying to rotate frames from my real-time video input, at first I used the CPU version of cv::rotate() but because of quite a big impact on FPS, I wanted to try to use the GPU (CUDA) version of it.
But I have a problem, whatever function I use after it I get the whole black frame back if I skip that and only move the frame to GPU and back, everything is working.

	frame = cv::imread("sample.jpg");

	cv::Mat im_in;
	frame.copyTo(im_in);

	cv::cuda::GpuMat gpu_im;
	gpu_im.upload(im_in);

	cv::Size size = frame.size();
	cv::cuda::rotate(gpu_im, gpu_im, gpu_im.size(), 0); //if I comment that out it gives normal image

	gpu_im.download(frame);

It doesn’t look like the source and destination GpuMat in cv::cuda::rotate() can be the same.

Additionaly when you use an angle ~= 0 you will need to shift the resulting image as the rotation is around the origin.

e.g.

GpuMat srcDevice(500, 500, CV_8UC3, { 0,0,0 }), dstDevice;
Mat dstHost;
const Size outSize(250, 250);
const int roiW = 200, roiH = 100;
Rect roi(0, 0, roiW, roiH);
srcDevice(roi).setTo({ 255,255,255 });
srcDevice.download(dstHost);
imshow("Original Frame", dstHost);
waitKey(0);
rotate(srcDevice, dstDevice, outSize, 90, 0, roiW-1);
dstDevice.download(dstHost);
imshow("Rotated Shifted Frame", dstHost);
waitKey(0);

I tried with different GpuMats, as well as different rotations and offsets. I changed the angle to zero hoping that way it should give me the closest result to execution without that function.

I tried your code, and it gives some better results - now there’s a white box that rotates correctly. I changed dstHost to my camera frames but the effect is the same - white box.

Is it working for you now if you use your GpuMat instead of srcDevice? If not can you paste a snippet of your code.

My GpuMat? I only have normal Mat from VideoCapture, that I replaced dstHost in your code with.
If I understand correctly srcDevice.download(dstHost) should load my CPU frame to GPU one?

       // That gives me rotated white boxes

	GpuMat srcDevice(500, 500, CV_8UC3, { 0, 0, 0 }), dstDevice;
	Mat dstHost;
	const Size outSize(250, 250);
	const int roiW = 200, roiH = 100;
	Rect roi(0, 0, roiW, roiH);
	srcDevice(roi).setTo({ 255, 255, 255 });
	srcDevice.download(dstHost);
	imshow("Original Frame", dstHost);
	waitKey(1);
	rotate(srcDevice, dstDevice, outSize, 90, 0, roiW - 1);
	dstDevice.download(dstHost);
	imshow("Rotated Shifted Frame", dstHost);
	waitKey(1);
    // That gives me rotated white boxes
    
	VideoCapture capture;
	Mat frame;
	if(capture.isOpened())
	{
		capture >> frame;
	}
	else
	{
		return;
	}

	GpuMat srcDevice(500, 500, CV_8UC3, { 0, 0, 0 }), dstDevice;

	const Size outSize(250, 250);
	const int roiW = 200, roiH = 100;
	Rect roi(0, 0, roiW, roiH);
	srcDevice(roi).setTo({ 255, 255, 255 });
	srcDevice.download(frame);
	imshow("Original Frame", frame);
	waitKey(1);
	rotate(srcDevice, dstDevice, outSize, 90, 0, roiW - 1);
	dstDevice.download(frame);
	imshow("Rotated Shifted Frame", frame);
	waitKey(1);

    // That gives me nothing - black frames
    
	VideoCapture capture;
	Mat frame;
	if(capture.isOpened())
	{
		capture >> frame;
	}
	else
	{
		return;
	}

	GpuMat srcDevice(500, 500, CV_8UC3, { 0, 0, 0 }), dstDevice;

	 const Size outSize(250, 250);
	 const int roiW = 200, roiH = 100;
	//	Rect roi(0, 0, roiW, roiH);
	//	srcDevice(roi).setTo({ 255, 255, 255 });
	srcDevice.download(frame);
	imshow("Original Frame", frame);
	waitKey(1);
	rotate(srcDevice, dstDevice, outSize, 90, 0, roiW - 1);
	dstDevice.download(frame);
	imshow("Rotated Shifted Frame", frame);
	waitKey(1);

Let me clarify a couple of things to make sure we are on the same page. In order to perform any actions on a GpuMat in OpenCV you must either

  1. create a new GpuMat there, e.g.
    GpuMat srcDevice(500, 500, CV_8UC3, { 0,0,0 })
  2. or upload a host Mat by performing either
    GpuMat srcDevice(srcHost);
    or
    GpuMat srcDevice; srcDevice.upload(srcHost);

Then when you want to display/save or perform some actions on the host you need to download the frame back to the host
Mat srcHost; srcDevice.download(srcHost)

The reason I mention this in such detail even though your questions imply that you understand this concept already is that I can see you next question being along the lines of “Why is cv::cuda::rotate is slower than cv::rotate?”. This is possible but even if true unlikely to be the reason for your poor performance as I will try to explain below.

The act of uploading and downloading to the device is very costly and is limited by your PCI express interface not your GPU. That is there is fixed cost every time you upload/download a frame to/from the device which is independant of how powerful your GPU is. Therefore you want to perfom as much work on the device (not just a single rotate) as you can for each upload and download you perform, and if you don’t do this you will most likely find your GPU based code is slower than your CPU version.

I tested a naive implementation on a i5-8300H paired with a GTX 1060, processing a 1080p video. Each frame was decoded on the CPU then uploaded to the device, rotated and downloaded back to the host. As expected this was slower than keeping the frame on the host (CPU: ~7.3ms/frame, GPU:~11.3ms/frame), example code below.

const int nFrames = 200;
float elSecsCap, elSecsCapRotate, elSecsCapCudaRotate;
{
    VideoCapture cap(srcPath);
    Mat src, dst;
    const int64 startTicks = cv::getTickCount();
    for (int i = 0; i < nFrames; i++) cap.read(src);
    elSecsCap = 1000 * (cv::getTickCount() - startTicks) / (cv::getTickFrequency() * nFrames);
}

{
    VideoCapture cap(srcPath);
    Mat src, dst;
    const int64 startTicks = cv::getTickCount();
    for (int i = 0; i < nFrames; i++) {
        cap.read(src);
        cv::rotate(src, dst, RotateFlags::ROTATE_90_CLOCKWISE);
    }
    elSecsCapRotate = 1000 * (cv::getTickCount() - startTicks) / (cv::getTickFrequency() * nFrames);
}

{
    VideoCapture cap(srcPath);
    Mat src, dst;
    GpuMat srcDevice, dstDevice;
    const int64 startTicks = cv::getTickCount();
    for (int i = 0; i < nFrames; i++) {
        cap.read(src);
        srcDevice.upload(src);
        rotate(srcDevice, dstDevice, srcDevice.size(), -90, srcDevice.cols, 0);
        dstDevice.download(dst);
    }
    elSecsCapCudaRotate = 1000*(cv::getTickCount() - startTicks) / (cv::getTickFrequency()*nFrames);
}

cout << "VideoCapture : " << elSecsCap << "ms per frame." << endl;
cout << "VideoCapture + cv::Rotate : " << elSecsCapRotate << "ms per frame." << endl;
cout << "VideoCapture + upload + cv::cuda::rotate + download: " << elSecsCapCudaRotate << "ms per frame." << endl;

VideoCapture : 5.58578ms per frame.
VideoCapture + cv::Rotate : 7.31747ms per frame.
VideoCapture + upload + cv::cuda::rotate + download: 11.2842ms per frame.

This means as expected, on my set up at least that there is no advantage to moving just the rotate operation to the device. If it was me I would then try one or more of the following to see if I could leverage the additional performace from the GPU:

  1. Performing more than one action on the device for each upload and download, i.e. not just rotate.
  2. Use CUDA streams to overlap rotate with the upload and/or download operation.
  3. Use cudacodec::VideoReader to decode directly to the device and avoid the upload from host to device.

Ok, so it seems that my problem with blackness after rotation comes from me not understanding what that offsets were in the cuda::rotate() function, and because of that my image rotated outside of the seen box… Now I rotate using cuda::wrapAffine() and it’s working correctly.

You are right, transferring GPU<->CPU is costly but in my case CPU is working much worse.
Previously (on CPU) my video stream slowed down from 28FPS to 8FPS, now on GPU it’s slowing down to 17FPS
It’s much better but still, I hoped for better performance :confused:

I’m not sure why it’s so bad, I’m working on Jetson Xavier NX. And it’s not so weak of a device to work so poorly I think…?
Do you perhaps have some ideas why it’s that and how can I correct that to have better performance?

17FPS sounds pretty good considering the dload/upload overhead and the performance of the GPU.

As you have an Nvidia GPU I would try using cv::cudacodec::VideoReader to reduce the load on the arm processor and decode the frame directly to a GpuMat, or opening cv::VideoCapture with hardware acceleration if its supported on your system, e.g.

VideoCapture cap(srcPath, cv::CAP_FFMPEG, { CAP_PROP_HW_ACCELERATION, cv::VIDEO_ACCELERATION_ANY });
1 Like