Use Dnn.readNetFromModelOptimizer to detect objects

I know how to use Dnn.readNetFromDarknet() to detect objects and find their bounding boxes.
Now I want to use Dnn.readNetFromModelOptimizer () to do the same thing. So
I download yolo-v3-tiny-tf from intel open model zoo and convert to OpenVINO IR files.

I read yolo-v3-tiny-tf for understanding the definition of output elements. Based on its description, the converted output is as below:

Converted model

  1. The array of detection summary info, name - conv2d_9/BiasAdd/YoloRegion , shape - 1,255,13,13 . The anchor values are 81,82, 135,169, 344,319 .
  2. The array of detection summary info, name - conv2d_12/BiasAdd/YoloRegion , shape - 1,255,26,26 . The anchor values are 23,27, 37,58, 81,82 .

For each case format is B,N*85,Cx,Cy , where

  • B - batch size
  • N - number of detection boxes for cell
  • Cx , Cy - cell index

Detection box has format [ x , y , h , w , box_score , class_no_1 , …, class_no_80 ], where:

  • ( x , y ) - coordinates of box center relative to the cell
  • h , w - raw height and width of box, apply exponential function and multiply by corresponding anchors to get absolute height and width values
  • box_score - confidence of detection box in [0,1] range
  • class_no_1 ,…, class_no_80 - probability distribution over the classes in the [0,1] range, multiply by confidence value to get confidence of each class

So my code is below:
Functions:

public static String getShape(Mat mat) {
     StringBuilder sb = new StringBuilder("[");
     for(int x = 0; x < mat.dims(); x++) {
         sb.append(mat.size(x)).append(",");
     }
     sb.deleteCharAt(sb.length()-1);
     sb.append("]");
     return sb.toString();
}

Main portition:

    Net net = Dnn.readNetFromModelOptimizer(irXmlFile, irBinFile);
    net.setPreferableBackend(Dnn.DNN_BACKEND_INFERENCE_ENGINE);
    net.setPreferableTarget(Dnn.DNN_TARGET_CPU);
    Mat image = Imgcodecs.imread(imageFile);
    final Scalar scalar =  new Scalar(0);
    sz = new Size(416, 416); 
    final float scale =  1;
    boolean swapRB = true;
    Mat inputBlob = Dnn.blobFromImage(image, scale, sz, scalar, swapRB, false);
    net.setInput(inputBlob);
    outBlobNames = getOutputNames(net);
    log.trace("outBlobNames:{}", outBlobNames);
    List<Mat> result = new ArrayList<>();
    net.forward(result, outBlobNames);
    log.trace("result size:{}", result.size());
    int recordSize = 80; // 5 + class number
    for(int x = result.size()-1; x >=0 ; x--) {
        Mat level = result.get(x);

        int recNum = level.size(1) / (recordSize+5);
        int targetRows = (int)(level.total()/ (recordSize+5));
        log.trace("{}. layer:{}, level.rows():{}, total:{}, shape:{}, size:{}", x, outBlobNames.get(x), level.height(), level.total(), getShape(level), level.size());
        log.trace("    channels:{}, depth:{}, type:{}, step1:{}, elmSize1:{}, recNum:{}", level.channels(), level.depth(), level.type(), level.step1(), level.elemSize1(), recNum);
        Mat reshape = level.reshape(1,targetRows );
        log.trace("    reshape:{}, size:{}", getShape(reshape), reshape.size());
        for (int j = 0; j < reshape.rows(); ++j) {
            Mat row = reshape.row(j); // size: (1*85)
            float[] data = new float[recordSize+5];
            row.get(0, 0, data);
            float[] data2 = new float[recordSize];
            float boxScore = data[4];
            Mat scores = row.colRange(5, reshape.cols()); 
            scores.get(0,0, data2);
            Core.MinMaxLocResult mm = Core.minMaxLoc(scores);
            Point classIdPoint = mm.maxLoc;
            float  confidence = (float) mm.maxVal* boxScore;
            if (confidence < 0.6) continue;
            float xx = (float) data[0], yy = (float) data[1];
            log.trace("      {}: boxScore:{}, (x,y)={}x{}, classId:{}, confX:{}, row size:{}",
                    j,boxScore,xx, yy, classIdPoint.x, confidence, row.size());
        }
        if (x==1) break; // skip test zero
    }
}

The output is below:

20:08:45.737 INFO    OpenVinoTest: irXmlFile:E:\var\intel\yolo-v3-tiny-tf\FP32\yolo-v3-tiny-tf.xml
20:08:45.739 INFO    OpenVinoTest: irBinFile:E:\var\intel\yolo-v3-tiny-tf\FP32\yolo-v3-tiny-tf.bin
20:08:45.739 INFO    OpenVinoTest: imageFile:E:\yolo\dataset_800x480\images\car\vid01_010663.jpg
20:08:46.073 TRACE   OpenVinoTest: outBlobNames:[conv2d_12/Conv2D/YoloRegion, conv2d_9/Conv2D/YoloRegion]
20:08:46.183 TRACE   OpenVinoTest: result size:2
20:08:46.183 TRACE   OpenVinoTest: 1. layer:conv2d_9/Conv2D/YoloRegion, level.rows():-1, total:43095, shape:[1,255,13,13], size:255x1
20:08:46.187 TRACE   OpenVinoTest:     channels:1, depth:5, type:5, step1:43095, elmSize1:4, recNum:3
20:08:46.187 TRACE   OpenVinoTest:     reshape:[507,85], size:85x507
20:08:46.188 TRACE   OpenVinoTest:       0: boxScore:0.43266684, (x,y)=0.6555209x0.44876736, classId:21.0, confX:0.35278797, row size:85x1
20:08:46.189 TRACE   OpenVinoTest:       1: boxScore:0.55940187, (x,y)=0.48518425x0.619738, classId:3.0, confX:0.5095297, row size:85x1
20:08:46.189 TRACE   OpenVinoTest:       2: boxScore:0.35651144, (x,y)=0.5082219x0.39975503, classId:78.0, confX:0.31428596, row size:85x1
20:08:46.189 TRACE   OpenVinoTest:       3: boxScore:0.54387677, (x,y)=0.692047x0.61677384, classId:54.0, confX:0.4847116, row size:85x1
20:08:46.189 TRACE   OpenVinoTest:       4: boxScore:0.67963123, (x,y)=0.3839952x0.54561156, classId:0.0, confX:0.46246928, row size:85x1
20:08:46.189 TRACE   OpenVinoTest:       5: boxScore:-0.48908097, (x,y)=0.08473439x0.14021659, classId:71.0, confX:-0.60112673, row size:85x1
20:08:46.189 TRACE   OpenVinoTest:       6: boxScore:-1.2083349, (x,y)=-1.1671227x-1.2259784, classId:57.0, confX:-0.79102516, row size:85x1
20:08:46.189 TRACE   OpenVinoTest:       7: boxScore:0.10409927, (x,y)=-0.37625268x0.017036445, classId:49.0, confX:0.058245104, row size:85x1
20:08:46.189 TRACE   OpenVinoTest:       8: boxScore:0.00115722, (x,y)=0.0015306767x0.0012077022, classId:79.0, confX:1.409815E-4, row size:85x1
20:08:46.189 TRACE   OpenVinoTest:       9: boxScore:0.99902374, (x,y)=0.31599534x0.16228594, classId:0.0, confX:0.9940376, row size:85x1
20:08:46.189 TRACE   OpenVinoTest:       10: boxScore:0.03768371, (x,y)=0.08264625x0.0804672, classId:54.0, confX:0.018640134, row size:85x1
20:08:46.189 TRACE   OpenVinoTest:       11: boxScore:4.1239278E-4, (x,y)=0.14673999x0.013528102, classId:20.0, confX:6.939788E-5, row size:85x1
20:08:46.189 TRACE   OpenVinoTest:       12: boxScore:0.020088913, (x,y)=0.022435984x0.024656877, classId:51.0, confX:0.0014823506, row size:85x1
20:08:46.189 TRACE   OpenVinoTest:       13: boxScore:0.0013647558, (x,y)=0.010140199x0.004707809, classId:37.0, confX:2.643366E-4, row size:85x1
20:08:46.190 TRACE   OpenVinoTest:       14: boxScore:0.012278879, (x,y)=0.007483946x0.006147006, classId:72.0, confX:0.0076645594, row size:85x1
20:08:46.190 TRACE   OpenVinoTest:       15: boxScore:0.8039476, (x,y)=0.5736518x0.750716, classId:10.0, confX:0.73497933, row size:85x1
20:08:46.190 TRACE   OpenVinoTest:       16: boxScore:0.001434851, (x,y)=0.0050395555x0.0048673833, classId:76.0, confX:6.316115E-5, row size:85x1
20:08:46.190 TRACE   OpenVinoTest:       17: boxScore:8.649106E-5, (x,y)=0.0012307029x6.8674155E-4, classId:22.0, confX:7.879857E-6, row size:85x1
20:08:46.190 TRACE   OpenVinoTest:       18: boxScore:0.010655698, (x,y)=0.036913924x0.024807066, classId:66.0, confX:9.798995E-4, row size:85x1
20:08:46.190 TRACE   OpenVinoTest:       19: boxScore:3.6505933E-4, (x,y)=1.6786439E-4x3.7726483E-4, classId:53.0, confX:3.9082723E-5, row size:85x1
20:08:46.190 TRACE   OpenVinoTest:       20: boxScore:0.028334405, (x,y)=0.028142175x0.052258987, classId:66.0, confX:0.004819776, row size:85x1
20:08:46.190 TRACE   OpenVinoTest:       21: boxScore:0.037701566, (x,y)=0.017145908x0.022296172, classId:13.0, confX:0.0104359565, row size:85x1
20:08:46.190 TRACE   OpenVinoTest:       22: boxScore:0.0013395402, (x,y)=0.012514944x0.012410511, classId:64.0, confX:2.2230613E-4, row size:85x1
....
....
20:08:46.213 TRACE   OpenVinoTest:       504: boxScore:3.4390436E-4, (x,y)=3.149585E-4x3.4604076E-4, classId:67.0, confX:6.082574E-7, row size:85x1
20:08:46.213 TRACE   OpenVinoTest:       505: boxScore:0.0014405706, (x,y)=0.0012526602x0.0043147514, classId:60.0, confX:1.167454E-5, row size:85x1
20:08:46.213 TRACE   OpenVinoTest:       506: boxScore:2.1452908E-4, (x,y)=2.8849006E-4x4.766963E-4, classId:64.0, confX:9.3569435E-7, row size:85x1

Below is the correct output with Dnn.readNetFromDarknet with the same yolov3-tiny.weight and relevant configuration file:

20:22:50.352 INFO     OpenVinoTest: TEST testYoloInDnnDarknet
20:22:50.352 INFO     OpenVinoTest:     cfgFile:E:\yolo\yolo-coco\yolov3-tiny.cfg
20:22:50.352 INFO     OpenVinoTest: weightsFile:E:\yolo\yolo-coco\yolov3-tiny.weights
20:22:50.352 INFO     OpenVinoTest:   imageFile:E:\yolo\dataset_800x480\images\car\vid01_010663.jpg
20:22:50.579 TRACE    OpenVinoTest: result count: 2
20:22:50.579 INFO     OpenVinoTest:     0. layer:yolo_16, level.rows():507, total:43095, shape:[507,85], size:85x507
20:22:50.579 TRACE    OpenVinoTest:       0:  classId:0.0  row size:85x1, [0.05095583, 0.056363925, 0.16034615, 0.15128478, 1.27485455E-5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
20:22:50.579 TRACE    OpenVinoTest:       1:  classId:0.0  row size:85x1, [0.044864077, 0.03675152, 0.16261256, 0.42193738, 2.0352368E-6, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
20:22:50.579 TRACE    OpenVinoTest:       2:  classId:0.0  row size:85x1, [0.033654656, 0.03507777, 0.8410822, 0.6181883, 8.072384E-9, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

I made a comparasion for Dnn.readNetFromModelOptimizer and Dnn.readNetFromDarknet with the same weights file and configuration file.

As you can see, the output of Dnn.readNetFromModelOptimizer and Dnn.readNetFromDarknet have the same shape (85x507), but the contents are different.

My problems are:

  1. Based on the above yolo-v3-tiny-tf page, it said N is the number of detected boxes. So, after manual computation, I know in this case, N is 3. However, From the output of Mat level, I don’t know which row is so-called detected box ?
  2. And I feel the output of Mat level is something wrong, but I don’t know where it is and how to adjust ?
    Could anyone please give some suggestions? Many thanks.