res10_300x300_ssd_iter_140000_fp16.caffemodel
net = cv2.dnn.readNetFromCaffe(prototxt_path, model_path)
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
blob = cv2.dnn.blobFromImage(cv2.resize(frame, (300, 300)), 1.0, (300, 300), (104.0, 177.0, 123.0))
net.setInput(blob)
detections = net.forward()
# detections.shape is (1, 1, 200, 7)
# so detection.shape[2] is 200
# so you are looping through the detected faces
for i in range(detections.shape[2]):
confidence = detections[0, 0, i, 2]
if confidence > 0.5:
x1 = int(detections[0, 0, i, 3] * frame_w)
y1 = int(detections[0, 0, i, 4] * frame_h)
x2 = int(detections[0, 0, i, 5] * frame_w)
y2 = int(detections[0, 0, i, 6] * frame_h)
bboxes.append([x1, y1, x2, y2])
bb_line_thickness = max(1, int(round(frame_h / 200)))
# Draw bounding boxes around detected faces.
cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0),
bb_line_thickness, cv2.LINE_8)
I used chatGPT to figure it out but is there official documentation?
detections[a, b, c, d]
is 4D tensor (matrix)
- a is index to select one of the image in batch AKA bunch of input images
- Usually index is 0 because you input 1 image, neutral network output one image.
- b is index of output channels.
- In the case of DNNs in OpenCV, usually index is 0
- c is the index of detected faces
- depends how many faces are detected
- d is the index between 0 to 6
- 0 - ???
- 1 - ???
- 2 - confidence score
- 3 - normalized top left x of detected face (number between 0 to 1)
- 4 - normalized top left y of detected face (number between 0 to 1)
- 5 - normalized bottom right x of detected face (number between 0 to 1)
- 6 - normalized bottom right y of detected face (number between 0 to 1)
- detections[0, 0, i, 2]
means confidence level and is type <class 'numpy.float32'>