I have a project where varying number of image samples have to be classified in some given time. There are two Tesla A100 GPUs where a single application will use one of them. Because there seemed to be a quite (too) low processing speed, I started specific tests which raised some questions.
Basically, what I observed is, that, given a fixed set of images, performing blobFromImages once and then calling .setInput and .forward N times (to measure the average time of N runs) shows a nice speedup when using batch processing.
However, when I perform blobFromImages in every run and then it becomes very slow.
I guess that this has to do with the gpu memory management. Some hypothesis (probably not all are correct):
- when the input blob doesnt change, neither .setInput nor .forward transfers any data to the GPU
- when the batch-size changes, the GPU will deallocate/allocate memory space
- when the batch size changes, something expensive is happening in cuDNN
- GPU might see that the blob hasn’t changed and performs dramatical optimizations
- EfficientNet B0 isn’t efficient on Nvidia Tesla A100
There is more strange stuff observed. For example,
- when using an EfficientNet B0, switching to FP16 gives nearly no speedup.
- using a VGG-16 is similar speed or even faster than the EfficientNet B0, even though on CPU, the EfficientNet is 4x faster and in theory it should even be 10-40x faster (according to literature FLOPs).
- using a VGG-16 with FP16 target gives 5x speedup compared to VGG-16 FP32. Maximum speedup should be close to 2x (?!?)
Here are the measured results ( 100 repetitions ):
EfficientNet FP32:
batch = 1 - 8.04 ms per sample
batch = 8 - 3.762 ms per sample
batch = 64 - 3.866 ms per sample
EfficientNet FP16:
batch = 1 - 7.445 ms per sample
batch = 8 - 33.522 ms per sample
batch = 64 - 3.799 ms per sample
VGG-16 FP32:
batch = 1 - 7.901 ms per sample
batch = 8 - 2.442 ms per sample
batch = 64 - 2.047 ms per sample
VGG-16 FP16:
batch = 1 - 1.6 ms per sample
batch = 8 - 0.47 ms per sample
batch = 64 - 0.32 ms per sample
And this is when I compute blobFromImages in every repetition. But I am quite sure that the call itself isn’t the expensive thing (it is very fast with --noDnn parameter). However I will test that in one of the next tests.
Generate new blob in each iteration (with random batch size in range [min, batch]): EfficientNet FP32:
batch = 64 - 12.186 ms per sample
I have to perform another test with fixed batch-size but generation of blobs in every iteration.
Here is my code. The HighPerformanceClock::GetCurrentTimestampMicroseconds_static() is just a wrapper to a call of
const std::chrono::system_clock::time_point & timePoint = GetCurrentTimestamp_static();
std::chrono::system_clock::duration tp = timePoint.time_since_epoch();
return std::chrono::duration_cast<std::chrono::microseconds>(tp).count();
This is the code:
int main(int argc, char* argv[])
{
for (int i = 0; i < argc; ++i)
{
std::cout << argv[i] << " ";
}
std::cout << std::endl;
//LOG_DEBUG(cv::getBuildInformation());
// --------------------------------------------------------------
const cv::String configKeys =
"{help h usage ? | | print this message }"
"{model classifier m | | model path }"
"{input i in data dataPath | | (optional) input image path. If not given, a black dummy image is used. (default = 1) }"
"{channels c | | model input channels. }"
"{width w | | model input width. }"
"{height h | | model input height. }"
"{fp16 | | (optional) activate fp16 inference. }"
"{batch b | | (optional) batch size (default = 1) }"
"{repeat r | | (optional) number of repeated inferences (default = 1) }"
"{generate g | | (optional) activate blob generation on each repetition. }"
"{min | | (optional) minimum batch size }"
"{threads nthreads cvThreads| | (optional) number of threads (default = 1: sequential) }"
"{noDNN nodnn noDnn | | (optional) no forward path. }"
;
// --------------------------------------------------------------
cv::CommandLineParser parser(argc, argv, configKeys);
parser.about("imageClassificationTest");
if (parser.has("help"))
{
parser.printMessage();
return 0;
}
std::cout << "Warning: Only NCHW blobs are currently supported." << std::endl;
std::string modelPath = "model.onnx";
std::string imagePath = "";
int channels = 1;
int width = 224;
int height = 224;
bool fp16 = false;
int batch = 1;
int repeat = 1;
int threads = 1;
bool generate = false;
int min = 0;
int max = 0;
bool noDnn = false;
if (parser.has("model")) modelPath = parser.get<std::string>("model");
if (parser.has("input")) imagePath = parser.get<std::string>("input");
if (parser.has("channels")) channels = parser.get<int>("channels");
if (parser.has("width")) width = parser.get<int>("width");
if (parser.has("height")) height = parser.get<int>("height");
if (parser.has("fp16")) fp16 = true;
if (parser.has("batch")) batch = parser.get<int>("batch");
if (parser.has("repeat")) repeat = parser.get<int>("repeat");
if (parser.has("threads")) threads = parser.get<int>("threads");
if (parser.has("generate")) generate = true;
if (parser.has("min")) min = parser.get<int>("min");
if (parser.has("noDNN")) noDnn = true;
cv::setNumThreads(threads);
if(!(channels == 1 || channels == 3))
{
std::cout << "ERROR: channels must be either 1 or 3" << std::endl;
return 1;
}
if(batch < 1)
{
std::cout << "ERROR: batch must be >= 1" << std::endl;
return 1;
}
cv::Mat img;
if(imagePath != "")
{
img = cv::imread(imagePath);
}
else
{
std::cout << "INFO: Using black dummy image." << std::endl;
img = cv::Mat::zeros(cv::Size(width, height), CV_8UC3);
}
if(channels == 1)
{
cv::cvtColor(img, img, cv::COLOR_BGR2GRAY);
}
const int nDevices = cv::cuda::getCudaEnabledDeviceCount();
if (nDevices > 0)
{
cv::cuda::setDevice(0);
}
cv::dnn::Net mNet = cv::dnn::readNetFromONNX(modelPath);
//std::vector<int> inputShape = { 1,(int)mInputHeightDNN,(int)mInputWidthDNN, (int)mInputChannelsDNN };
//int64 flops = mNet.getFLOPS(inputShape);
//LOG_DEBUG("DNN FLOPS: " << flops << " for input shape: " << inputShape[0] << "x" << inputShape[1] << "x" << inputShape[2] << "x" << inputShape[3]);
//mNet.dumpToFile("C:/data/DNN_DUMP.dot");
mNet.setPreferableBackend(cv::dnn::Backend::DNN_BACKEND_CUDA);
if(fp16) mNet.setPreferableTarget(cv::dnn::Target::DNN_TARGET_CUDA_FP16);
else mNet.setPreferableTarget(cv::dnn::Target::DNN_TARGET_CUDA);
auto mOutputLayerDNN = GetOutputsNamesDNN(mNet);
cv::Mat blob;
std::vector<cv::Mat> inputImages;
inputImages.reserve(batch);
for(int i=0; i<batch; ++i)
{
inputImages.push_back(img.clone());
}
double scale = 1.0/255.0;
double mean = 125;
if(batch == 1)
{
cv::dnn::blobFromImage(img, blob, scale, cv::Size(width, height), mean, true, false);
}
else
{
cv::dnn::blobFromImages(inputImages, blob, scale, cv::Size(width, height), mean, true, false);
}
try
{
std::cout << "Dummy call for initialization" << std::endl;
{
mNet.setInput(blob);
std::vector<cv::Mat> networkOutputs;
mNet.forward(networkOutputs, mOutputLayerDNN);
}
std::cout << "Finished dummy call for initialization. " << std::endl;
const long long ts_repeat_start = HighPerformanceClock::GetCurrentTimestampMicroseconds_static();
//LOG_TRACE("DNN setInput");
long long samples = 0;
for(int i=0; i<repeat; ++i)
{
if (generate || (min > 0))
{
if (batch == 1)
{
cv::dnn::blobFromImage(img, blob, scale, cv::Size(width, height), mean, true, false);
samples++;
}
else
{
auto inputBatch = inputImages;
int randomBatchSize = rand() % (batch - min) + min;
if (randomBatchSize > batch) std::cout << "ERROR: Current batch size is: " << randomBatchSize << std::endl;
//else std::cout << "Current batch: " << randomBatchSize << std::endl;
inputBatch.resize(randomBatchSize);
cv::dnn::blobFromImages(inputBatch, blob, scale, cv::Size(width, height), mean, true, false);
samples += randomBatchSize;
}
}
else
{
samples += batch;
}
std::vector<cv::Mat> networkOutputs;
const long long ts_dnnInt_start = HighPerformanceClock::GetCurrentTimestampMicroseconds_static();
mNet.setInput(blob);
const long long ts_dnnInt_end = HighPerformanceClock::GetCurrentTimestampMicroseconds_static();
const long long ts_dnnForw_start = HighPerformanceClock::GetCurrentTimestampMicroseconds_static();
if(!noDnn) mNet.forward(networkOutputs, mOutputLayerDNN);
const long long ts_dnnForw_end = HighPerformanceClock::GetCurrentTimestampMicroseconds_static();
}
const long long ts_repeat_end = HighPerformanceClock::GetCurrentTimestampMicroseconds_static();
long long totaltime = (ts_repeat_end - ts_repeat_start);
long long totaltime_per_repeat = totaltime / repeat;
long long totaltime_per_sample = totaltime / samples;
std::cout << "Total time for " << repeat << " repetitions of batch size " << batch << " : " << totaltime/1000.0 << " ms" << std::endl;
std::cout << "That is " << totaltime_per_sample / 1000.0 << " ms per sample" << std::endl;
}
catch (std::exception& e)
{
std::cout << "Exception in opencv-dnn forward: " << e.what();
}
return 0;
}
What is the right usage of batch sizes > 1 with Cuda backend/target? Should I use a completely fixed batch size, or what else should I take care of?