How to correctly use blobFromImages with cuda backend/target?

I have a project where varying number of image samples have to be classified in some given time. There are two Tesla A100 GPUs where a single application will use one of them. Because there seemed to be a quite (too) low processing speed, I started specific tests which raised some questions.

Basically, what I observed is, that, given a fixed set of images, performing blobFromImages once and then calling .setInput and .forward N times (to measure the average time of N runs) shows a nice speedup when using batch processing.

However, when I perform blobFromImages in every run and then it becomes very slow.

I guess that this has to do with the gpu memory management. Some hypothesis (probably not all are correct):

  • when the input blob doesnt change, neither .setInput nor .forward transfers any data to the GPU
  • when the batch-size changes, the GPU will deallocate/allocate memory space
  • when the batch size changes, something expensive is happening in cuDNN
  • GPU might see that the blob hasn’t changed and performs dramatical optimizations
  • EfficientNet B0 isn’t efficient on Nvidia Tesla A100

There is more strange stuff observed. For example,

  • when using an EfficientNet B0, switching to FP16 gives nearly no speedup.
  • using a VGG-16 is similar speed or even faster than the EfficientNet B0, even though on CPU, the EfficientNet is 4x faster and in theory it should even be 10-40x faster (according to literature FLOPs).
  • using a VGG-16 with FP16 target gives 5x speedup compared to VGG-16 FP32. Maximum speedup should be close to 2x (?!?)

Here are the measured results ( 100 repetitions ):

EfficientNet FP32:
batch = 1 - 8.04 ms per sample
batch = 8 - 3.762 ms per sample
batch = 64 - 3.866 ms per sample

EfficientNet FP16:
batch = 1 - 7.445 ms per sample
batch = 8 - 33.522 ms per sample
batch = 64 - 3.799 ms per sample

VGG-16 FP32:
batch = 1 - 7.901 ms per sample
batch = 8 - 2.442 ms per sample
batch = 64 - 2.047 ms per sample

VGG-16 FP16:
batch = 1 - 1.6 ms per sample
batch = 8 - 0.47 ms per sample
batch = 64 - 0.32 ms per sample

And this is when I compute blobFromImages in every repetition. But I am quite sure that the call itself isn’t the expensive thing (it is very fast with --noDnn parameter). However I will test that in one of the next tests.

Generate new blob in each iteration (with random batch size in range [min, batch]): EfficientNet FP32:
batch = 64 - 12.186 ms per sample
I have to perform another test with fixed batch-size but generation of blobs in every iteration.

Here is my code. The HighPerformanceClock::GetCurrentTimestampMicroseconds_static() is just a wrapper to a call of

const std::chrono::system_clock::time_point & timePoint = GetCurrentTimestamp_static();
	std::chrono::system_clock::duration tp = timePoint.time_since_epoch();
	return std::chrono::duration_cast<std::chrono::microseconds>(tp).count();

This is the code:

int main(int argc, char* argv[])
{

	for (int i = 0; i < argc; ++i)
	{
		std::cout << argv[i] << " ";
	}
	std::cout << std::endl;

	//LOG_DEBUG(cv::getBuildInformation());

	// --------------------------------------------------------------
	const cv::String configKeys =
		"{help h usage ? |      | print this message   }"
		"{model classifier m |      | model path }"
		"{input i in data dataPath |      | (optional) input image path. If not given, a black dummy image is used. (default = 1) }"
		"{channels c |      | model input channels. }"
		"{width w |      | model input width. }"
		"{height h |      | model input height. }"
		"{fp16 |      | (optional) activate fp16 inference. }"
		"{batch b |      | (optional) batch size (default = 1) }"
		"{repeat r |      | (optional) number of repeated inferences (default = 1) }"
		"{generate g |      | (optional) activate blob generation on each repetition. }"
		"{min |      | (optional) minimum batch size }"
		"{threads nthreads cvThreads|      | (optional) number of threads (default = 1: sequential) }"
		"{noDNN nodnn noDnn |      | (optional) no forward path. }"
		;
	// --------------------------------------------------------------


	cv::CommandLineParser parser(argc, argv, configKeys);
	parser.about("imageClassificationTest");

	if (parser.has("help"))
	{
		parser.printMessage();
		return 0;
	}

	std::cout << "Warning: Only NCHW blobs are currently supported." << std::endl;

	std::string modelPath = "model.onnx";
	std::string imagePath = "";
	int channels = 1;
	int width = 224;
	int height = 224;
	bool fp16 = false;
	int batch = 1;
	int repeat = 1;
	int threads = 1;

	bool generate = false;
	int min = 0;
	int max = 0;

	bool noDnn = false;

	if (parser.has("model")) modelPath = parser.get<std::string>("model");
	if (parser.has("input")) imagePath = parser.get<std::string>("input");
	if (parser.has("channels")) channels = parser.get<int>("channels");
	if (parser.has("width")) width = parser.get<int>("width");
	if (parser.has("height")) height = parser.get<int>("height");
	if (parser.has("fp16")) fp16 = true;
	if (parser.has("batch")) batch = parser.get<int>("batch");
	if (parser.has("repeat")) repeat = parser.get<int>("repeat");
	if (parser.has("threads")) threads = parser.get<int>("threads");

	if (parser.has("generate")) generate = true;
	if (parser.has("min")) min = parser.get<int>("min");

	if (parser.has("noDNN")) noDnn = true;


	cv::setNumThreads(threads);

	if(!(channels == 1 || channels == 3))
	{
		std::cout << "ERROR: channels must be either 1 or 3" << std::endl;
		return 1;
	}

	if(batch < 1)
	{
		std::cout << "ERROR: batch must be >= 1" << std::endl;
		return 1;
	}

	cv::Mat img;
	if(imagePath != "")
	{
		img = cv::imread(imagePath);
	}
	else
	{
		std::cout << "INFO: Using black dummy image." << std::endl;
		img = cv::Mat::zeros(cv::Size(width, height), CV_8UC3);
	}

	if(channels == 1)
	{
		cv::cvtColor(img, img, cv::COLOR_BGR2GRAY);
	}

	const int nDevices = cv::cuda::getCudaEnabledDeviceCount();
	if (nDevices > 0)
	{
		cv::cuda::setDevice(0);
	}


	cv::dnn::Net mNet = cv::dnn::readNetFromONNX(modelPath);

	//std::vector<int> inputShape = { 1,(int)mInputHeightDNN,(int)mInputWidthDNN,  (int)mInputChannelsDNN };
	//int64 flops = mNet.getFLOPS(inputShape);
	//LOG_DEBUG("DNN FLOPS: " << flops << " for input shape: " << inputShape[0] << "x" << inputShape[1] << "x" << inputShape[2] << "x" << inputShape[3]);

	//mNet.dumpToFile("C:/data/DNN_DUMP.dot");

	mNet.setPreferableBackend(cv::dnn::Backend::DNN_BACKEND_CUDA);
	if(fp16) mNet.setPreferableTarget(cv::dnn::Target::DNN_TARGET_CUDA_FP16);
	else mNet.setPreferableTarget(cv::dnn::Target::DNN_TARGET_CUDA);
	

	auto mOutputLayerDNN = GetOutputsNamesDNN(mNet);

	cv::Mat blob;

	std::vector<cv::Mat> inputImages;
	inputImages.reserve(batch);
	for(int i=0; i<batch; ++i)
	{
		inputImages.push_back(img.clone());
	}

	double scale = 1.0/255.0;
	double mean = 125;
	if(batch == 1)
	{
		cv::dnn::blobFromImage(img, blob, scale, cv::Size(width, height), mean, true, false);
	}
	else
	{
		cv::dnn::blobFromImages(inputImages, blob, scale, cv::Size(width, height), mean, true, false);
	}



	try
	{
		std::cout << "Dummy call for initialization" << std::endl;
		{
			mNet.setInput(blob);
			std::vector<cv::Mat> networkOutputs;
			mNet.forward(networkOutputs, mOutputLayerDNN);
		}
		std::cout << "Finished dummy call for initialization. " << std::endl;

		const long long ts_repeat_start = HighPerformanceClock::GetCurrentTimestampMicroseconds_static();
		//LOG_TRACE("DNN setInput");
		long long samples = 0;
		for(int i=0; i<repeat; ++i)
		{
			if (generate || (min > 0))
			{
				if (batch == 1)
				{
					cv::dnn::blobFromImage(img, blob, scale, cv::Size(width, height), mean, true, false);
					samples++;
				}
				else
				{
					auto inputBatch = inputImages;
					int randomBatchSize = rand() % (batch - min) + min;
					if (randomBatchSize > batch) std::cout << "ERROR: Current batch size is: " << randomBatchSize << std::endl;
					//else std::cout << "Current batch: " << randomBatchSize << std::endl;
					inputBatch.resize(randomBatchSize);
					cv::dnn::blobFromImages(inputBatch, blob, scale, cv::Size(width, height), mean, true, false);

					samples += randomBatchSize;
				}
			}
			else
			{
				samples += batch;
			}

			std::vector<cv::Mat> networkOutputs;
		
			const long long ts_dnnInt_start = HighPerformanceClock::GetCurrentTimestampMicroseconds_static();
			mNet.setInput(blob);
			const long long ts_dnnInt_end = HighPerformanceClock::GetCurrentTimestampMicroseconds_static();

			const long long ts_dnnForw_start = HighPerformanceClock::GetCurrentTimestampMicroseconds_static();

			if(!noDnn) mNet.forward(networkOutputs, mOutputLayerDNN);
			const long long ts_dnnForw_end = HighPerformanceClock::GetCurrentTimestampMicroseconds_static();
		}
		const long long ts_repeat_end = HighPerformanceClock::GetCurrentTimestampMicroseconds_static();
	
		long long totaltime = (ts_repeat_end - ts_repeat_start);
		long long totaltime_per_repeat = totaltime / repeat;
	
		long long totaltime_per_sample = totaltime / samples;

		std::cout << "Total time for " << repeat << " repetitions of batch size " << batch << " : " << totaltime/1000.0 << " ms" << std::endl;
		std::cout << "That is " << totaltime_per_sample / 1000.0 << " ms per sample" << std::endl;
	}
	catch (std::exception& e)
	{
		std::cout << "Exception in opencv-dnn forward: " << e.what();
	}


	return 0;
}

What is the right usage of batch sizes > 1 with Cuda backend/target? Should I use a completely fixed batch size, or what else should I take care of?

1 Like

ok I have just noticed that there is a bug in the --generate but --min=0 path.
I will perform an additional test witht the bugfix. So currently, the 12 ms results are for random batch size.

1 Like

Update:

I retested two things:

  1. using a very small (dummy) DNN, inference times are 0.035 ms for batch 64; 0.11 ms for batch 64 and generating the blob in each iteration; 0.278 ms for batch 64 and random batch size in each iteration.

  2. for the efficientNet and VGG, when blobs are generated in each iteration, but the batch size keeps the same, inference is as fast as using the same blob again and again.

For me it now looks like when .forward is called with a different batch size than in the previous run, the call becomes very slow.
Maybe this is because the model is reconfigured on the GPU?

What is then the best way to use cv::dnn::Net::forward with varying batch sizes? I see the following candidates:

  1. keeping multiple DNN configurations (for batch=1, batch=4, batch=8, …) in parallel and using the model that fits the current needed batch size. Disadvantage: Lots of memory needed on the GPU.

  2. using only a fixed batch size that is best trade of of processing speed per sample and as low overhead for actual batches which are smaller than that fixed batch size (for example fix=8. When the actual batchsize is < 8 then fill with dummy images; When actual batchsize > 8, split it and call forward multiple times). Disadvantage: Wasted GPU time when batch < fix.

1 Like