Any tips for debuggng performance issues of onnx model?

Hi guys,

I have a WSL2 Ubuntu20.04 with OpenCV 4.6.0 and CUDA 11.8. My inference performance on that machine is 25 frames.
Now, I’ve a docker image with has the same libraries and OpenCV have been compiled the same way, however performance is 15 frames.

I did check all the libraries that my application is linked with and they are the same in terms of versions. I’ve check OpenCV compiled options, the are the same. What can I do to track the issue here? Any advices?

I know for sure that this is not the docker issue, because I’ve observed similar situation on my Windows machine. I’ve compiled OpenCV with GPU support with Python bindings twice, and in one Python venv I am getting 25 frames, and in the other 15 frames. I’ve definitely had to do something different, but what?

The only difference is in performance, detected objects and probabilities are exactly the same, on each platform.

After profiling I am getting following results:
SYSTEM 1

ID name                                                                      count thr          min          max       median          avg       *self*          IPP   %       OpenCL   %
                                                                                               t-min        t-max     t-median        t-avg        total        t-IPP   %     t-OpenCL   %
  1 cv::dnn::dnn4_v20220524::Net::forward#net.cpp:93                            241   1       34.539     1569.779       35.574       43.279    10430.354        0.000   0        0.000   0
                                                                                              34.539     1569.779       35.574       43.279    10430.354        0.000   0        0.000   0
  2 cv::dnn::dnn4_v20220524::blobFromImage#dnn_utils.cpp:27                     241   1        2.027        8.872        2.661        3.036      731.583        0.000   0        0.000   0
                                                                                               2.027        8.872        2.661        3.036      731.583        0.000   0        0.000   0
  3 cv::VideoWriter::operator<<(const cv::Mat&)#cap.cpp:682                     241   1        1.586        3.697        1.953        2.009      484.226        0.000   0        0.000   0
                                                                                               1.586        3.697        1.953        2.009      484.226        0.000   0        0.000   0
  4 cv::VideoCapture::read#cap.cpp:372                                          242   1        0.086       14.356        1.393        1.514      366.407        0.000   0        0.000   0
                                                                                               0.086       14.356        1.393        1.514      366.407        0.000   0        0.000   0
  5 cv::dnn::dnn4_v20220524::Net::setInput#net.cpp:176                          241   1        0.361        0.714        0.396        0.415      100.053        0.000   0        0.000   0
                                                                                               0.361        0.714        0.396        0.415      100.053        0.000   0        0.000   0
  6 cv::blur#box_filter.dispatch.cpp:495                                        641   1        0.011        0.172        0.037        0.037       23.691        0.000   0        0.000   0
                                                                                               0.011        0.172        0.037        0.037       23.691        0.000   0        0.000   0
  7 cv::Mat::copyTo#copy.cpp:307                                                192   1        0.000        3.063        0.003        0.111       21.254        0.000   0        0.000   0
                                                                                               0.000        3.063        0.003        0.111       21.254        0.000   0        0.000   0
  8 cv::minMaxLoc#minmax.cpp:1566                                             24783   1        0.000        0.377        0.000        0.001       16.377        0.000   0        0.000   0
                                                                                               0.000        0.377        0.000        0.001       16.377        0.000   0        0.000   0
  9 cv::VideoWriter::open#cap.cpp:541                                             1   1        7.886        7.886        7.886        7.886        7.886        0.000   0        0.000   0
                                                                                               7.886        7.886        7.886        7.886        7.886        0.000   0        0.000   0
 10 cv::VideoCapture::VideoCapture#cap.cpp:74                                     1   1        1.620        1.620        1.620        1.620        1.620        0.000   0        0.000   0
                                                                                               1.620        1.620        1.620        1.620        1.620        0.000   0        0.000   0

SYSTEM 2

 ID name                                                                      count thr          min          max       median          avg       *self*          IPP   %       OpenCL   %
                                                                                               t-min        t-max     t-median        t-avg        total        t-IPP   %     t-OpenCL   %
  1 cv::dnn::dnn4_v20220524::Net::forward#net.cpp:93                            241   1       58.370     1596.572       60.892       67.375    16237.484        0.000   0        0.000   0
                                                                                              58.370     1596.572       60.892       67.375    16237.484        0.000   0        0.000   0
  2 cv::dnn::dnn4_v20220524::blobFromImage#dnn_utils.cpp:27                     241   1        2.127        8.301        2.901        3.201      771.325        0.000   0        0.000   0
                                                                                               2.127        8.301        2.901        3.201      771.325        0.000   0        0.000   0
  3 cv::VideoWriter::operator<<(const cv::Mat&)#cap.cpp:682                     241   1        1.597        3.186        1.991        2.042      492.145        0.000   0        0.000   0
                                                                                               1.597        3.186        1.991        2.042      492.145        0.000   0        0.000   0
  4 cv::VideoCapture::read#cap.cpp:372                                          242   1        0.128       21.976        1.357        1.522      368.270        0.000   0        0.000   0
                                                                                               0.128       21.976        1.357        1.522      368.270        0.000   0        0.000   0
  5 cv::dnn::dnn4_v20220524::Net::setInput#net.cpp:176                          241   1        0.359        0.872        0.397        0.429      103.314        0.000   0        0.000   0
                                                                                               0.359        0.872        0.397        0.429      103.314        0.000   0        0.000   0
  6 cv::blur#box_filter.dispatch.cpp:495                                        641   1        0.013        0.196        0.038        0.038       24.677        0.000   0        0.000   0
                                                                                               0.013        0.196        0.038        0.038       24.677        0.000   0        0.000   0
  7 cv::Mat::copyTo#copy.cpp:307                                                192   1        0.000        2.679        0.003        0.116       22.244        0.000   0        0.000   0
                                                                                               0.000        2.679        0.003        0.116       22.244        0.000   0        0.000   0
  8 cv::minMaxLoc#minmax.cpp:1566                                             24783   1        0.000        0.229        0.000        0.001       16.185        0.000   0        0.000   0
                                                                                               0.000        0.229        0.000        0.001       16.185        0.000   0        0.000   0
  9 cv::VideoWriter::open#cap.cpp:541                                             1   1        7.511        7.511        7.511        7.511        7.511        0.000   0        0.000   0
                                                                                               7.511        7.511        7.511        7.511        7.511        0.000   0        0.000   0
 10 cv::VideoCapture::VideoCapture#cap.cpp:74                                     1   1        2.522        2.522        2.522        2.522        2.522        0.000   0        0.000   0
                                                                                               2.522        2.522        2.522        2.522        2.522        0.000   0        0.000   0

So problem lies in cv::dnn::dnn4_v20220524::Net::forward but how to debug it further?

just curious, how do you do that ?
(and what do we see above ?)

Hi @berak , thank you for your interest. So, I was looking for a way to profile my application.. Firstly I’ve profiled it on both systems using GNU gprof. Results did not show anything unusual, but gprof only allowed me to see my own functions as OpenCV was not built with option to emit profiling information. It could probably be built like that but further googling led me to Profiling OpenCV Applications · opencv/opencv Wiki · GitHub. Using their instructions I was able to collect profiling data.

What we see above is the 10 most time consuming functions executed within realms of OpenCV. They are sorted in descending order. So the first one is the most time consuming.
As you can see, on both systems executing of application was exactly the same (you can tell that by looking at count column which says how many times given function was called).
However, there is a difference how long cv::dnn::dnn4_v20220524::Net::forwardexecuted. On System 1 average time for one execution was 43 milliseconds, on System 2 it was 67 milliseconds.

So now I know that the problem lies within cv::dnn::dnn4_v20220524::Net::forward, somehow it behaves differently on both systems. I also know that this only happens on CUDA backend (cv::dnn::DNN_BACKEND_CUDA).

My next idea is to profile application using NVIDIA Nsight Compute maybe this will give me a hint what’s going on.

1 Like

how similar are those two systems really? what differences can you identify?

Similarities:
Ubuntu 20.04
CUDA 11.8 CuDNN 8.7
OpenCV 4.6.0 built with the same options (I diffed cv::getBuildInformation())
Application linked with the same libraries (I diffed ldd output).

Differences
System 1 is WSL2 image, System 2 is Docker image.
System 1 has much more libraries pre-installed.
OpenCV built libraries are different at binary level (different size).

I suspect that CMake script must have picked up something different without being explicit what it was.

I can reproduce the issue every time. I’ve built multiple docker images from scratch and used multiple WSL2 instances and I got the same results every time.

I’d wager it’s that.

both of these ARE abstractions. expect (differing) friction losses in the translation from sandbox to host.

if you could get both WSL2 and Docker to work on a single system, I think you will see performance differences even then.

Docker on windows starts a virtual machine just like in MacOS. Hence the delays.
WLS is not a virtualization and it runs faster.
Now, I know that you claim it to have worked on windows, but this is my humble opinion.