Hello:
Why the method ‘detect of cv2.dnn_Detection’ consumes more time on the first frame. On quadro K1000 GPU I have more than 2.7s for the first frame and an average of 11ms for 213 frames processed. How it can be reduced?
Are you using the CUDA backend?
YES? I USE OPENCV COMPILED FROM SOURCE ON CUDA.
This my code
net_Detector1 = cv2.dnn.readNet(Weights_file,CFG_file, “darknet”)
net_Detector1.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net_Detector1.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
model_Detector1 = cv2.dnn_DetectionModel(net_Detector1 )
model_Detector1.setInputParams(size=size_Detector, scale=1/255, swapRB=True)
It depends on your code. If this is the first call to any CUDA function the delay is from the context creation, if not the delay is a result of the CUDA dnn libraries being loaded.
I use yolov4 and how can we say that it is real time if we have all this latency that is related to the first call and loading of libraries?
I don’t see how initialization cost which is a one off and could for instance happen a year before you perform inference has anything to do with latency or realtime processing.
I think first iteration cost because data transfert (weight 246Mbyte ) between CPU and GPU
and may be code compilation too (as opencl)
LOOP 0
Tps cuda :2474.64ms
Tps opencv :401.108ms
[ WARN:0@9.264] global ocl4dnn_conv_spatial.cpp:1923 cv::dnn::ocl4dnn::OCL4DNNConvSpatial<float>::loadTunedConfig OpenCV(ocl4dnn): consider to specify kernel configuration cache directory through OPENCV_OCL4DNN_CONFIG_PATH parameter.
OpenCL program build log: dnn/dummy
Status -11: CL_BUILD_PROGRAM_FAILURE
-cl-no-subgroup-ifp
Error in processing command line: Don't understand command line argument "-cl-no-subgroup-ifp"!
Tps inference :654.125ms
Tps vino :1475.66ms
*******************
LOOP 1
Tps cuda :12.2015ms
Tps opencv :286.404ms
Tps inference :98.1013ms
Tps vino :162.483ms
*******************
LOOP 2
Tps cuda :13.2503ms
Tps opencv :285.352ms
Tps inference :98.0035ms
Tps vino :162.986ms
Initiation is done at the beginning of the code before inference. I’m talking about the time of the inference of the first frame.
Thank you for your reply. I understand that weights are transferred when calling the read function (cv2.dnn.readNet(Weights_file,CFG_file, “darknet”)). and the cv2.dnn_DetectionModel function allows to perform inference directly.
I don’t think so :
net_cuda= dnn::readNet("hed_deploy.prototxt", "hed_pretrained_bsds.caffemodel");
weight are in memory
then you select backend
tpsInference.start();
net_cuda.setPreferableTarget(DNN_TARGET_CUDA);
net_cuda.setPreferableBackend(DNN_BACKEND_CUDA);
tpsInference.stop();
cout << "Tps select cuda :" << tpsInference.getTimeMilli() << "ms" << endl;
tpsInference.reset();
Result is
Tps select cuda :0.0019ms
Then perform a single inference at the begining of the code on a blank frame.
Additionaly run cuda::setDevice(0)
(0 if you only have one GPU) before calling any other cuda code to ensure the context is created before you initialize your network. Then you will know for sure if the delay is from context creation, dnn library loading or a host to device memory copy.