I’m trying to use the TIMVX support in OpenCV to run an ONNX model via python, but it doesn’t look like it’s actually running on the NPU at all. When I run the model the CPU usage goes to 400% (quad core device) and the time to run inference is the same as when I run the inference on the CPU directly using onnxruntime or even using OpenCV.
I’ve both built OpenCV from scratch and used a prebuilt package from Khadas for the Vim3. getBuildInfo() returns:
General configuration for OpenCV 4.8.1 =====================================
Version control: unknown
Platform:
Timestamp: 2023-12-21T06:21:07Z
Host: Linux 4.9.241 aarch64
CMake: 3.16.3
CMake generator: Unix Makefiles
CMake build tool: /usr/bin/make
Configuration: RELEASE
CPU/HW features:
Baseline: NEON FP16
C/C++:
Built as dynamic libs?: YES
C++ standard: 11
C++ Compiler: /usr/bin/c++ (ver 9.4.0)
C++ flags (Release): -fsigned-char -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Winit-self -Wpointer-arith -Wsign-promo -Wuninitialized -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections -fvisibility=hidden -fvisibility-inlines-hidden -Wno-unused-parameter -Wno-undef -Wno-sign-compare -Wno-unused-but-set-variable -Wno-shadow -Wno-suggest-override -Wno-missing-declarations -Wno-switch -O3 -DNDEBUG -DNDEBUG
C++ flags (Debug): -fsigned-char -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Winit-self -Wpointer-arith -Wsign-promo -Wuninitialized -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections -fvisibility=hidden -fvisibility-inlines-hidden -Wno-unused-parameter -Wno-undef -Wno-sign-compare -Wno-unused-but-set-variable -Wno-shadow -Wno-suggest-override -Wno-missing-declarations -Wno-switch -g -O0 -DDEBUG -D_DEBUG
C Compiler: /usr/bin/cc
C flags (Release): -fsigned-char -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Winit-self -Wpointer-arith -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections -fvisibility=hidden -Wno-unused-parameter -Wno-strict-prototypes -Wno-undef -Wno-sign-compare -Wno-missing-prototypes -Wno-missing-declarations -Wno-strict-aliasing -Wno-unused-but-set-variable -Wno-maybe-uninitialized -Wno-shadow -Wno-switch -O3 -DNDEBUG -DNDEBUG
C flags (Debug): -fsigned-char -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Winit-self -Wpointer-arith -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections -fvisibility=hidden -Wno-unused-parameter -Wno-strict-prototypes -Wno-undef -Wno-sign-compare -Wno-missing-prototypes -Wno-missing-declarations -Wno-strict-aliasing -Wno-unused-but-set-variable -Wno-maybe-uninitialized -Wno-shadow -Wno-switch -g -O0 -DDEBUG -D_DEBUG
Linker flags (Release): -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined
Linker flags (Debug): -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined
ccache: NO
Precompiled headers: NO
Extra dependencies: dl m pthread rt
3rdparty dependencies:
OpenCV modules:
To be built: calib3d core dnn features2d flann highgui imgcodecs imgproc ml objdetect photo python3 stitching ts video videoio
Disabled: gapi world
Disabled by dependency: -
Unavailable: java python2
Applications: tests perf_tests apps
Documentation: NO
Non-free algorithms: NO
GUI: GTK2
GTK+: YES (ver 2.24.32)
GThread : YES (ver 2.64.6)
GtkGlExt: NO
VTK support: NO
Media I/O:
ZLib: /usr/lib/aarch64-linux-gnu/libz.so (ver 1.2.11)
JPEG: libjpeg-turbo (ver 2.1.3-62)
WEBP: build (ver encoder: 0x020f)
PNG: /usr/lib/aarch64-linux-gnu/libpng.so (ver 1.6.37)
TIFF: build (ver 42 - 4.2.0)
JPEG 2000: build (ver 2.5.0)
OpenEXR: build (ver 2.3.0)
HDR: YES
SUNRASTER: YES
PXM: YES
PFM: YES
Video I/O:
DC1394: NO
FFMPEG: NO
avcodec: YES (53.7.0)
avformat: YES (53.4.0)
avutil: YES (51.9.1)
swscale: YES (2.0.0)
avresample: NO
GStreamer: NO
v4l/v4l2: YES (linux/videodev2.h)
Parallel framework: pthreads
Trace: YES (with Intel ITT)
Other third-party libraries:
Lapack: NO
Eigen: NO
Custom HAL: YES (carotene (ver 0.0.1))
Protobuf: build (3.19.1)
Flatbuffers: builtin/3rdparty (23.5.9)
Tim-VX: YES
Include path /home/khadas/123/opencv-build/3rdparty/libtim-vx/TIM-VX-1d9c7ab941b3d8d9c4d28d80058402725731e3d6/include
Link libraries: tim-vx
VIVANTE SDK path /usr
OpenCL: YES (no extra features)
Include path: /home/khadas/123/opencv-4.8.1/3rdparty/include/opencl/1.2
Link libraries: Dynamic load
Python 3:
Interpreter: /usr/bin/python3 (ver 3.8.10)
Libraries: /usr/lib/aarch64-linux-gnu/libpython3.8.so (ver 3.8.10)
numpy: /usr/lib/python3/dist-packages/numpy/core/include (ver 1.17.4)
install path: lib/python3.8/site-packages/cv2/python-3.8
Python (for build): /usr/bin/python2.7
Java:
ant: NO
Java: NO
JNI: NO
Java wrappers: NO
Java tests: NO
Install to: /home/khadas/123/opencv-install
-----------------------------------------------------------------
The code I am using is modified from the detect.py sample from OpenCV_NPU_Demo repo and it looks like this:
import os
import sys
import argparse
from typing import Tuple
import cv2
import numpy as np
from priorbox import PriorBox
from utils import draw
import time
def current_milli_time():
return round(time.time() * 1000)
def str2bool(v: str) -> bool:
if v.lower() in ['true', 'yes', 'on', 'y', 't']:
return True
elif v.lower() in ['false', 'no', 'off', 'n', 'f']:
return False
else:
raise NotImplementedError
parser = argparse.ArgumentParser(description='A demo for running libfacedetection using OpenCV\'s DNN module.')
# OpenCV DNN
# Location
parser.add_argument('--image', help='Path to the image.')
parser.add_argument('--model', type=str, help='Path to .onnx model file.')
# Inference
parser.add_argument('--conf_thresh', default=0.6, type=float, help='Threshold for filtering out faces with conf < conf_thresh.')
parser.add_argument('--nms_thresh', default=0.3, type=float, help='Threshold for non-max suppression.')
parser.add_argument('--keep_top_k', default=750, type=int, help='Keep keep_top_k for results outputing.')
# Result
parser.add_argument('--vis', default=True, type=str2bool, help='Set True to visualize the result image.')
parser.add_argument('--save', default='result.jpg', type=str, help='Path to save the result image.')
args = parser.parse_args()
# Build the blob
assert os.path.exists(args.image), 'File {} does not exist!'.format(args.image)
img = cv2.imread(args.image, cv2.IMREAD_COLOR)
h, w, _ = img.shape
print('Image size: h={}, w={}'.format(h, w))
image = cv2.resize(img, (256, 256)) # Resize to expected input dimensions
image = image.astype(np.float32) / 255.0 # Normalize pixel values
image = image.transpose((2, 0, 1)) # Change channel order to (C, H, W)
input_data = np.expand_dims(image, axis=0) # Add batch dimension
blob = cv2.dnn.blobFromImage(img) # 'size' param resize the output to the given shape
# Load the net
print("Loading model...")
net = cv2.dnn.readNet(args.model)
print("...done")
# NPU
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_TIMVX)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_NPU)
# CPU
#net.setPreferableBackend(cv2.dnn.DNN_BACKEND_DEFAULT)
#net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)
for i in range(10):
# Run the net
tStart=current_milli_time()
output_names = ['cmap','paf']
# net.setInput(blob)
net.setInput(input_data)
cmap, paf = net.forward(output_names)
tEnd=current_milli_time()
print("took {}ms".format(tEnd-tStart))
I run it with the following command:
python3 pose.py --image ~/src/test/model_test/pose1.jpg --model=pose_densenet121_body.onnx
And when I run it the output shows it taking about 1.5s per inference and the CPU usage spikes. Inference times for this model should be under 100ms if it’s running on the NPU.
Am I doing something wrong in my code or is there something else going on?
It’s probably worth noting when I run the provided demo with the provided model (detect.py with yunet_int8.onnx) detections take about 90 seconds! Something is definitely not working right.