TIMVX (Amlogic 311D2 NPU on Khadas Vim3)

I’m trying to use the TIMVX support in OpenCV to run an ONNX model via python, but it doesn’t look like it’s actually running on the NPU at all. When I run the model the CPU usage goes to 400% (quad core device) and the time to run inference is the same as when I run the inference on the CPU directly using onnxruntime or even using OpenCV.

I’ve both built OpenCV from scratch and used a prebuilt package from Khadas for the Vim3. getBuildInfo() returns:

General configuration for OpenCV 4.8.1 =====================================
  Version control:               unknown

  Platform:
    Timestamp:                   2023-12-21T06:21:07Z
    Host:                        Linux 4.9.241 aarch64
    CMake:                       3.16.3
    CMake generator:             Unix Makefiles
    CMake build tool:            /usr/bin/make
    Configuration:               RELEASE

  CPU/HW features:
    Baseline:                    NEON FP16

  C/C++:
    Built as dynamic libs?:      YES
    C++ standard:                11
    C++ Compiler:                /usr/bin/c++  (ver 9.4.0)
    C++ flags (Release):         -fsigned-char -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Winit-self -Wpointer-arith -Wsign-promo -Wuninitialized -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -fvisibility-inlines-hidden -Wno-unused-parameter -Wno-undef -Wno-sign-compare -Wno-unused-but-set-variable -Wno-shadow -Wno-suggest-override -Wno-missing-declarations -Wno-switch -O3 -DNDEBUG  -DNDEBUG
    C++ flags (Debug):           -fsigned-char -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Winit-self -Wpointer-arith -Wsign-promo -Wuninitialized -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -fvisibility-inlines-hidden -Wno-unused-parameter -Wno-undef -Wno-sign-compare -Wno-unused-but-set-variable -Wno-shadow -Wno-suggest-override -Wno-missing-declarations -Wno-switch -g  -O0 -DDEBUG -D_DEBUG
    C Compiler:                  /usr/bin/cc
    C flags (Release):           -fsigned-char -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Winit-self -Wpointer-arith -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -Wno-unused-parameter -Wno-strict-prototypes -Wno-undef -Wno-sign-compare -Wno-missing-prototypes -Wno-missing-declarations -Wno-strict-aliasing -Wno-unused-but-set-variable -Wno-maybe-uninitialized -Wno-shadow -Wno-switch -O3 -DNDEBUG  -DNDEBUG
    C flags (Debug):             -fsigned-char -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Winit-self -Wpointer-arith -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections    -fvisibility=hidden -Wno-unused-parameter -Wno-strict-prototypes -Wno-undef -Wno-sign-compare -Wno-missing-prototypes -Wno-missing-declarations -Wno-strict-aliasing -Wno-unused-but-set-variable -Wno-maybe-uninitialized -Wno-shadow -Wno-switch -g  -O0 -DDEBUG -D_DEBUG
    Linker flags (Release):      -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined
    Linker flags (Debug):        -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined
    ccache:                      NO
    Precompiled headers:         NO
    Extra dependencies:          dl m pthread rt
    3rdparty dependencies:

  OpenCV modules:
    To be built:                 calib3d core dnn features2d flann highgui imgcodecs imgproc ml objdetect photo python3 stitching ts video videoio
    Disabled:                    gapi world
    Disabled by dependency:      -
    Unavailable:                 java python2
    Applications:                tests perf_tests apps
    Documentation:               NO
    Non-free algorithms:         NO

  GUI:                           GTK2
    GTK+:                        YES (ver 2.24.32)
      GThread :                  YES (ver 2.64.6)
      GtkGlExt:                  NO
    VTK support:                 NO

  Media I/O:
    ZLib:                        /usr/lib/aarch64-linux-gnu/libz.so (ver 1.2.11)
    JPEG:                        libjpeg-turbo (ver 2.1.3-62)
    WEBP:                        build (ver encoder: 0x020f)
    PNG:                         /usr/lib/aarch64-linux-gnu/libpng.so (ver 1.6.37)
    TIFF:                        build (ver 42 - 4.2.0)
    JPEG 2000:                   build (ver 2.5.0)
    OpenEXR:                     build (ver 2.3.0)
    HDR:                         YES
    SUNRASTER:                   YES
    PXM:                         YES
    PFM:                         YES

  Video I/O:
    DC1394:                      NO
    FFMPEG:                      NO
      avcodec:                   YES (53.7.0)
      avformat:                  YES (53.4.0)
      avutil:                    YES (51.9.1)
      swscale:                   YES (2.0.0)
      avresample:                NO
    GStreamer:                   NO
    v4l/v4l2:                    YES (linux/videodev2.h)

  Parallel framework:            pthreads

  Trace:                         YES (with Intel ITT)

  Other third-party libraries:
    Lapack:                      NO
    Eigen:                       NO
    Custom HAL:                  YES (carotene (ver 0.0.1))
    Protobuf:                    build (3.19.1)
    Flatbuffers:                 builtin/3rdparty (23.5.9)

  Tim-VX:                        YES
    Include path                 /home/khadas/123/opencv-build/3rdparty/libtim-vx/TIM-VX-1d9c7ab941b3d8d9c4d28d80058402725731e3d6/include
    Link libraries:              tim-vx
    VIVANTE SDK path             /usr

  OpenCL:                        YES (no extra features)
    Include path:                /home/khadas/123/opencv-4.8.1/3rdparty/include/opencl/1.2
    Link libraries:              Dynamic load

  Python 3:
    Interpreter:                 /usr/bin/python3 (ver 3.8.10)
    Libraries:                   /usr/lib/aarch64-linux-gnu/libpython3.8.so (ver 3.8.10)
    numpy:                       /usr/lib/python3/dist-packages/numpy/core/include (ver 1.17.4)
    install path:                lib/python3.8/site-packages/cv2/python-3.8

  Python (for build):            /usr/bin/python2.7

  Java:
    ant:                         NO
    Java:                        NO
    JNI:                         NO
    Java wrappers:               NO
    Java tests:                  NO

  Install to:                    /home/khadas/123/opencv-install
-----------------------------------------------------------------

The code I am using is modified from the detect.py sample from OpenCV_NPU_Demo repo and it looks like this:

import os
import sys
import argparse
from typing import Tuple

import cv2
import numpy as np

from priorbox import PriorBox
from utils import draw
import time

def current_milli_time():
    return round(time.time() * 1000)

def str2bool(v: str) -> bool:
     if v.lower() in ['true', 'yes', 'on', 'y', 't']:
          return True
     elif v.lower() in ['false', 'no', 'off', 'n', 'f']:
          return False
     else:
          raise NotImplementedError

parser = argparse.ArgumentParser(description='A demo for running libfacedetection using OpenCV\'s DNN module.')
# OpenCV DNN
# Location
parser.add_argument('--image', help='Path to the image.')
parser.add_argument('--model', type=str, help='Path to .onnx model file.')
# Inference
parser.add_argument('--conf_thresh', default=0.6, type=float, help='Threshold for filtering out faces with conf < conf_thresh.')
parser.add_argument('--nms_thresh', default=0.3, type=float, help='Threshold for non-max suppression.')
parser.add_argument('--keep_top_k', default=750, type=int, help='Keep keep_top_k for results outputing.')
# Result
parser.add_argument('--vis', default=True, type=str2bool, help='Set True to visualize the result image.')
parser.add_argument('--save', default='result.jpg', type=str, help='Path to save the result image.')
args = parser.parse_args()


# Build the blob
assert os.path.exists(args.image), 'File {} does not exist!'.format(args.image)
img = cv2.imread(args.image, cv2.IMREAD_COLOR)
h, w, _ = img.shape
print('Image size: h={}, w={}'.format(h, w))

image = cv2.resize(img, (256, 256)) # Resize to expected input dimensions
image = image.astype(np.float32) / 255.0 # Normalize pixel values
image = image.transpose((2, 0, 1)) # Change channel order to (C, H, W)
input_data = np.expand_dims(image, axis=0) # Add batch dimension


blob = cv2.dnn.blobFromImage(img) # 'size' param resize the output to the given shape

# Load the net
print("Loading model...")
net = cv2.dnn.readNet(args.model)
print("...done")

# NPU
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_TIMVX)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_NPU)

# CPU
#net.setPreferableBackend(cv2.dnn.DNN_BACKEND_DEFAULT)
#net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)

for i in range(10):
  # Run the net
  tStart=current_milli_time()
  output_names = ['cmap','paf']
#  net.setInput(blob)
  net.setInput(input_data)
  cmap, paf = net.forward(output_names)
  tEnd=current_milli_time()

  print("took {}ms".format(tEnd-tStart))

I run it with the following command:


python3 pose.py  --image ~/src/test/model_test/pose1.jpg --model=pose_densenet121_body.onnx

And when I run it the output shows it taking about 1.5s per inference and the CPU usage spikes. Inference times for this model should be under 100ms if it’s running on the NPU.

Am I doing something wrong in my code or is there something else going on?

It’s probably worth noting when I run the provided demo with the provided model (detect.py with yunet_int8.onnx) detections take about 90 seconds! Something is definitely not working right.

I’m not sure you’re gonna find adequate help here. That looks very specific. Perhaps look through the issues and PRs on opencv’s github. github has “discussions” but those don’t appear to be generally open on opencv’s github.

this is the PR that introduced support for TIMVX: https://github.com/opencv/opencv/pull/21036

does your desired dnn backend show up in cv::dnn::getAvailableBackends()? OpenCV: Deep Neural Network module

yeah that’s a C++-only function, no python binding. perhaps you could file that as a bug/enhancement, and mention your actual issue as a footnote.

getAvailableTargets(backend) is available in Python though.

if your code executed at all, then the backend should be among [be for be in dir(cv.dnn) if be.startswith("DNN_BACKEND_")] so it should have been in the build. that suggests the issue to occur at runtime.

Yes it is in the list of available backends. The code I ran is:

import cv2

# Get available targets for a specific backend (e.g., CUDA)
backend_id = cv2.dnn.DNN_BACKEND_TIMVX
targets = cv2.dnn.getAvailableTargets(backend_id)
print("Available targets for TIMVX backend:", targets)

for be in dir(cv2.dnn):
 if be.startswith("DNN_BACKEND_"):
    print(be)

and the output was:

Available targets for TIMVX backend: [9]
DNN_BACKEND_CANN
DNN_BACKEND_CUDA
DNN_BACKEND_DEFAULT
DNN_BACKEND_HALIDE
DNN_BACKEND_INFERENCE_ENGINE
DNN_BACKEND_OPENCV
DNN_BACKEND_TIMVX
DNN_BACKEND_VKCOM
DNN_BACKEND_WEBNN

I’ll take a look on github, but I’m starting to think it just doesn’t work well if at all.

Thanks for the quick reply. I appreciate the help :slight_smile:

perhaps a matter of drivers.

does that NPU come with examples that use something other than OpenCV to reach the device? then at least you could find out if the device works at all, and it’s “just” an OpenCV issue. in that case, you could try filing an issue about it. if that turns out to be a matter of build options, then that would mean “please improve the documentation”.

if you do that, you could @ the original contributors of the PR that added this backend and target.