I am trying to speed up image classification by running multiple copies of the same DNN model on separate threads using python’s multiprocessing library.
However when processing the images serially the execution time to complete is half that of running them in parallel. To process a single image on my model takes ~10ms so running these sequentially for 4 images would give the expected execution total of ~40ms. To run this in parallel I would expect the 4 images to complete in ~10ms but instead it takes ~80ms, twice as long as serially.
Using the following code to benchmark this;
import cv2
import numpy as np
import time
import multiprocessing
print("cpu core count=", multiprocessing.cpu_count())
print("cv num threads=", cv2.getNumThreads())
#print(cv2.getBuildInformation())
class_names = ['back', 'none', 'has']
warmupFile = "1691030495-1691030503213.jpg"
images = [
cv2.imread("1693945449-1693945454785.jpg", cv2.IMREAD_COLOR),
cv2.imread("1687735037-1687735041649.jpg", cv2.IMREAD_COLOR),
cv2.imread("1693945449-1693945453982.jpg", cv2.IMREAD_COLOR),
cv2.imread("1693102756-1693102758762.jpg", cv2.IMREAD_COLOR),
]
size = 4
# load DNN model
def newModel(onnxFile, img):
model = cv2.dnn.readNetFromONNX(onnxFile)
if model.empty():
print("error loading model")
exit(1)
model.setPreferableBackend(cv2.dnn.DNN_BACKEND_DEFAULT)
model.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)
# warm up model's first run
warmup = cv2.imread(img, cv2.IMREAD_COLOR)
processimg(model, warmup)
return model
def processimg(model, img):
# resize to 320x240
resizeImg = cv2.resize(src=img, dsize=(320, 240),
fx=0, fy=0, interpolation=cv2.INTER_AREA)
mean = np.array([0.485, 0.456, 0.406]) * 255.0
std = [0.229, 0.224, 0.225]
blob = cv2.dnn.blobFromImage(image=resizeImg, scalefactor=1.0 / 255.0, size=(224,224),
mean=mean, swapRB=True, crop=False)
# divide by std
blob[0] /= np.asarray(std, dtype=np.float32).reshape(3, 1, 1)
model.setInput(blob)
prob = model.forward()
out = softmax(np.squeeze(prob))
return prob, out
def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
def runjob(i):
prob, out = processimg(models[i], images[i])
class_id = np.argmax(prob)
print("predict=",class_names[class_id], " confidence=", out[class_id]*100,"%"," exe time=",
(time.time() - start)*1000,"ms")
# initialize models
models = []
for i in range(0,size):
nextModel = "./models/model-%d.onnx" % i
models.append(newModel(nextModel, warmupFile))
# start processing images
# serial execution
start = time.time()
for i in range(0,size):
runjob(i)
print("total serial exe time=", (time.time() - start)*1000,"ms")
# parallel execution
start = time.time()
pool = multiprocessing.Pool(processes=size)
pool.map(runjob, range(0,size))
print("total parallel exe time=", (time.time() - start)*1000,"ms")
Running this on my workstation gives the following output.
cpu core count= 20
cv num threads= 20
predict= none confidence= 98.39194416999817 % exe time= 9.763717651367188 ms
predict= has confidence= 99.57807064056396 % exe time= 19.548892974853516 ms
predict= none confidence= 95.81201076507568 % exe time= 28.734445571899414 ms
predict= back confidence= 99.99748468399048 % exe time= 41.81838035583496 ms
total serial exe time= 41.91398620605469 ms
predict= has confidence= 99.57807064056396 % exe time= 79.77533340454102 ms
predict= back confidence= 99.99748468399048 % exe time= 79.78200912475586 ms
predict= none confidence= 98.39194416999817 % exe time= 80.55520057678223 ms
predict= none confidence= 95.81201076507568 % exe time= 83.93979072570801 ms
total parallel exe time= 84.3496322631836 ms
I preload the Models and Images and warm the Model by running an image through it since first classification runs slow.
As the result for parallel processing is not expected can any of you see whats wrong?