Model Parallelism/Multithreading DNN Model is slower than sequential execution

I am trying to speed up image classification by running multiple copies of the same DNN model on separate threads using python’s multiprocessing library.

However when processing the images serially the execution time to complete is half that of running them in parallel. To process a single image on my model takes ~10ms so running these sequentially for 4 images would give the expected execution total of ~40ms. To run this in parallel I would expect the 4 images to complete in ~10ms but instead it takes ~80ms, twice as long as serially.

Using the following code to benchmark this;

import cv2
import numpy as np
import time
import multiprocessing

print("cpu core count=", multiprocessing.cpu_count())
print("cv num threads=", cv2.getNumThreads())
#print(cv2.getBuildInformation())

class_names = ['back', 'none', 'has']

warmupFile = "1691030495-1691030503213.jpg"
images = [
    cv2.imread("1693945449-1693945454785.jpg", cv2.IMREAD_COLOR),
    cv2.imread("1687735037-1687735041649.jpg", cv2.IMREAD_COLOR),
    cv2.imread("1693945449-1693945453982.jpg", cv2.IMREAD_COLOR),
    cv2.imread("1693102756-1693102758762.jpg", cv2.IMREAD_COLOR),
]

size = 4

# load DNN model
def newModel(onnxFile, img):
    model = cv2.dnn.readNetFromONNX(onnxFile)

    if model.empty():
        print("error loading model")
        exit(1)

    model.setPreferableBackend(cv2.dnn.DNN_BACKEND_DEFAULT)
    model.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU) 

    # warm up model's first run
    warmup = cv2.imread(img, cv2.IMREAD_COLOR)
    processimg(model, warmup)

    return model

def processimg(model, img):
    # resize to 320x240
    resizeImg = cv2.resize(src=img, dsize=(320, 240),
           fx=0, fy=0, interpolation=cv2.INTER_AREA)

    mean = np.array([0.485, 0.456, 0.406]) * 255.0
    std = [0.229, 0.224, 0.225]

    blob = cv2.dnn.blobFromImage(image=resizeImg, scalefactor=1.0 / 255.0, size=(224,224),
                             mean=mean, swapRB=True, crop=False)

    # divide by std
    blob[0] /= np.asarray(std, dtype=np.float32).reshape(3, 1, 1)

    model.setInput(blob)
    prob = model.forward()
    out = softmax(np.squeeze(prob))
    return prob, out


def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()


def runjob(i):
    prob, out = processimg(models[i], images[i])

    class_id = np.argmax(prob)
    print("predict=",class_names[class_id], " confidence=", out[class_id]*100,"%"," exe time=",
          (time.time() - start)*1000,"ms")


# initialize models
models = []
for i in range(0,size):
    nextModel = "./models/model-%d.onnx" % i
    models.append(newModel(nextModel, warmupFile)) 


# start processing images

# serial execution
start = time.time()
for i in range(0,size):
    runjob(i)

print("total serial exe time=", (time.time() - start)*1000,"ms")

# parallel execution
start = time.time()
pool = multiprocessing.Pool(processes=size)
pool.map(runjob, range(0,size))

print("total parallel exe time=", (time.time() - start)*1000,"ms")

Running this on my workstation gives the following output.

cpu core count= 20
cv num threads= 20
predict= none  confidence= 98.39194416999817 %  exe time= 9.763717651367188 ms
predict= has  confidence= 99.57807064056396 %  exe time= 19.548892974853516 ms
predict= none  confidence= 95.81201076507568 %  exe time= 28.734445571899414 ms
predict= back  confidence= 99.99748468399048 %  exe time= 41.81838035583496 ms
total serial exe time= 41.91398620605469 ms
predict= has  confidence= 99.57807064056396 %  exe time= 79.77533340454102 ms
predict= back  confidence= 99.99748468399048 %  exe time= 79.78200912475586 ms
predict= none  confidence= 98.39194416999817 %  exe time= 80.55520057678223 ms
predict= none  confidence= 95.81201076507568 %  exe time= 83.93979072570801 ms
total parallel exe time= 84.3496322631836 ms

I preload the Models and Images and warm the Model by running an image through it since first classification runs slow.

As the result for parallel processing is not expected can any of you see whats wrong?

what’s the difference between them ?

what about batching your imges, using
cv2.dnn.blobFromImages ?
(and try to use opencv’s internal (data-based) parallelization, instead of trying to wrap your own (thread/task based) around it)

The Models files are all the same onnx file copied.

I did run a version of the script using a single onnx model file used by all threads and got the same results. I wondered if there was some type of lock on the model file which may have caused the unexpected result but no change occurred when using duplicate files.

I have just tried changing the test to avoid OpenCV and use a simple sleep with a new runjob() of;

def runjob(i):
    time.sleep(0.01)
    print("job exe time=", (time.time() - start)*1000,"ms")

This gave the output of;

job exe time= 10.088205337524414 ms
job exe time= 20.206689834594727 ms
job exe time= 30.312061309814453 ms
job exe time= 40.41624069213867 ms
total serial exe time= 40.44032096862793 ms
job exe time= 23.56123924255371 ms
job exe time= 23.647069931030273 ms
job exe time= 23.77796173095703 ms
job exe time= 23.76723289489746 ms
total parallel exe time= 24.188995361328125 ms

This result is what I would expect, however it hints at the problem where there is an added overhead with python setting up the threading. For a simple time.sleep() it adds 13ms to the execution of the job.

By modifying my original script and putting a sleep into the jobrun() eg;

def runjob(i):
    time.sleep(1)
    prob, out = processimg(models[i], images[i])

    class_id = np.argmax(prob)
    print("predict=",class_names[class_id], " confidence=", out[class_id]*100,"%"," exe time=",
          (time.time() - start)*1000,"ms")

This results in the following;

predict= none  confidence= 98.39194416999817 %  exe time= 1010.5865001678467 ms
predict= has  confidence= 99.57807064056396 %  exe time= 2023.1900215148926 ms
predict= none  confidence= 95.81201076507568 %  exe time= 3034.2249870300293 ms
predict= back  confidence= 99.99748468399048 %  exe time= 4048.3639240264893 ms
total serial exe time= 4048.428535461426 ms
predict= none  confidence= 98.39194416999817 %  exe time= 1070.512056350708 ms
predict= back  confidence= 99.99748468399048 %  exe time= 1070.6267356872559 ms
predict= none  confidence= 95.81201076507568 %  exe time= 1071.1612701416016 ms
predict= has  confidence= 99.57807064056396 %  exe time= 1072.0620155334473 ms
total parallel exe time= 1072.5388526916504 ms

So clearly the issue is the overhead python is adding to do the threading and is adding around 60ms of execution time for setup.

a more python oriented forum would be a better place to address that

for multiple threads, you will need a unique dnn instance, the file on disk does not matter

Why is a unique DNN instance needed? What is happening in OpenCV’s code for it to care about that?

To clarify the following code should not matter to OpenCV as the same DNN model is assigned to a different variable and underlying address.

model1 = cv2.dnn.readNetFromONNX("model.onnx")
model2 = cv2.dnn.readNetFromONNX("model.onnx")

Trying out cv2.dnn.blobFromImages to do batching produces an error;

    prob = model.forward()
           ^^^^^^^^^^^^^^^
cv2.error: OpenCV(4.8.0) /io/opencv/modules/dnn/src/layers/fully_connected_layer.cpp:214: error: (-215:Assertion failed) srcMat.dims == 2 && srcMat.cols == weights.cols && dstMat.rows == srcMat.rows && dstMat.cols == weights.rows && srcMat.type() == weights.type() && weights.type() == dstMat.type() && srcMat.type() == CV_32F && (biasMat.empty() || (biasMat.type() == srcMat.type() && biasMat.isContinuous() && (int)biasMat.total() == dstMat.cols)) in function 'run'

Codes as follows;


import cv2
import numpy as np
import time

class_names = ['back', 'none', 'has']

modelFile = "./models/model-0.onnx"

images = [
    cv2.imread("1693945449-1693945454785.jpg", cv2.IMREAD_COLOR),
    cv2.imread("1687735037-1687735041649.jpg", cv2.IMREAD_COLOR),
    cv2.imread("1693945449-1693945453982.jpg", cv2.IMREAD_COLOR),
    cv2.imread("1693102756-1693102758762.jpg", cv2.IMREAD_COLOR),
]

size = 4

# load DNN model
model = cv2.dnn.readNetFromONNX(modelFile)

if model.empty():
    print("error loading model")
    exit(1)

model.setPreferableBackend(cv2.dnn.DNN_BACKEND_DEFAULT)
model.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)


# batch process images
batchImg = []

for i in range(0,size):
    # resize to 320x240
    resizeImg = cv2.resize(src=images[i], dsize=(320, 240),
                       fx=0, fy=0, interpolation=cv2.INTER_AREA)
    batchImg.append(resizeImg)

mean = np.array([0.485, 0.456, 0.406]) * 255.0
std = [0.229, 0.224, 0.225]

blob = cv2.dnn.blobFromImages(images=batchImg, scalefactor=1.0 / 255.0, size=(224, 224),
                             mean=mean, swapRB=True, crop=False)

# divide by std
for i in range(0,size):
    blob[i] /= np.asarray(std, dtype=np.float32).reshape(3, 1, 1)

model.setInput(blob)
prob = model.forward()

you aren’t doing any multithreading. you are doing multiprocessing. that’s processes getting started and data getting copied.

use multithreading instead. everything you’ve been told about “python multithreading being slow” is filthy lies (oversimplifications told to newbies). it’s got limitations but those don’t apply here.

The error I am getting from using cv2.dnn.blobFromImages appears to be how the ONNX file is saved with the current model having an input dimension of (1,3,224,224).

ONNX saved to handle input batches of size 4, but this is still slower than running images sequentially. Actually slower than the multiprocess code.

predict= noprey  confidence= 62.222665548324585 %  exe time= 90.24906158447266 ms
predict= prey  confidence= 93.16756129264832 %  exe time= 90.33393859863281 ms
predict= noprey  confidence= 59.74287986755371 %  exe time= 90.37899971008301 ms
predict= background  confidence= 99.87664222717285 %  exe time= 90.41929244995117 ms
2nd run total exec time= 90.4381275177002 ms
1 Like