OpenCV DNN Semantic Segmention doesn't produce expected result with Deeplabv3.onnx Model

I am trying to use Deeplabv3.onnx model in OpenCV DNN. The model has been exported from PyTorch. Even though I don’t get any compile or runtime error in opencv with the model, the implementation doesn’t produce the expected segmented result. I think the output blob from the network was not properly decoded causing improper segmentation results. I am basically using OpenCV DNN Segmentation.cpp sample code and modified a little bit to preprocess the input image before passing it to the network. It would be great if you could review the code and advise me. Thanks in advance for your valuable time.

Segmention.cpp Code:

#include <sstream>

#include <opencv2/dnn.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/highgui.hpp>

std::string keys =
    "{ help  h         |                                                   | Print help message. }"
    "{ model           | deeplabv3.onnx      | Path to input image or video file. Skip this argument to capture frames from a camera. }"
    "{ config          | <none>                                            | Path to model config file}"
    "{ input i         | opencv-samples/data/vtest.avi                                                  | Path to input image or video file. Skip this argument to capture frames from a camera. }"
    "{ device          |  0                                                | camera device number. }"
    "{ initial_width   | 256                                               | Preprocess input image by initial resizing to a specific width.}"
    "{ initial_height  | 256                                               | Preprocess input image by initial resizing to a specific height.}"
    "{ width           | 224                                               | Path to input image or video file. Skip this argument to capture frames from a camera. }"
    "{ height          | 224                                               |  }"
    "{ scale           | 1.0                                               | Scale of the input image }"
    "{ rgb             | true                                              | Path to input image or video file. Skip this argument to capture frames from a camera. }"
    "{ mean            | 0.485 0.456 0.406                                             | Path to input image or video file. Skip this argument to capture frames from a camera. }"
    "{ std             | 0.229 0.224 0.225                                       | Preprocess input image by dividing on a standard deviation.}"
    "{ framework f     |                                                   | Optional name of an origin framework of the model. Detect it automatically if it does not set. }"
    "{ classes         |                                                   | Optional path to a text file with names of classes. }"
    "{ colors          |                                                   | Optional path to a text file with colors for an every class. "
                                                                            "An every color is represented with three values from 0 to 255 in BGR channels order. }"
    "{ backend         | 0                                                 | Choose one of computation backends: "
                                                                            "0: automatically (by default), "
                                                                            "1: Halide language (http://halide-lang.org/), "
                                                                            "2: Intel's Deep Learning Inference Engine (https://software.intel.com/openvino-toolkit), "
                                                                            "3: OpenCV implementation }"
    "{ target          | 1                                                 | Choose one of target computation devices: "
                                                                            "0: CPU target (by default), "
                                                                            "1: OpenCL, "
                                                                            "2: OpenCL fp16 (half-float precision), "
                                                                            "3: VPU }";

using namespace cv;
using namespace dnn;

std::vector<std::string> classes;
std::vector<Vec3b> colors;

void showLegend();

void colorizeSegmentation(const Mat &score, Mat &segm);

int main(int argc, char** argv)
{
    CommandLineParser parser(argc, argv, keys);
    parser.about("Semantic segmentation deep learning networks using OpenCV.");

    int rszWidth = parser.get<int>("initial_width");
    int rszHeight = parser.get<int>("initial_height");
    float scale = parser.get<float>("scale");
    Scalar mean = parser.get<Scalar>("mean");
    Scalar std = parser.get<Scalar>("std");
    bool swapRB = parser.get<bool>("rgb");
    int inpWidth = parser.get<int>("width");
    int inpHeight = parser.get<int>("height");
    String model = parser.get<String>("model");
    String config = parser.get<String>("config");
    String framework = parser.get<String>("framework");
    int backendId = parser.get<int>("backend");
    int targetId = parser.get<int>("target");

#ifdef DNDEBUG
    if (argc == 1 || parser.has("help"))
    {
        parser.printMessage();
        return 0;
    }
#endif

    // Open file with classes names.
    if (parser.has("classes"))
    {
        std::string file = parser.get<String>("classes");
        std::ifstream ifs(file.c_str());
        if (!ifs.is_open())
            CV_Error(Error::StsError, "File " + file + " not found");
        std::string line;
        while (std::getline(ifs, line))
        {
            classes.push_back(line);
        }
    }

    // Open file with colors.
    if (parser.has("colors"))
    {
        std::string file = parser.get<String>("colors");
        std::ifstream ifs(file.c_str());
        if (!ifs.is_open())
            CV_Error(Error::StsError, "File " + file + " not found");
        std::string line;
        while (std::getline(ifs, line))
        {
            std::istringstream colorStr(line.c_str());

            Vec3b color;
            for (int i = 0; i < 3 && !colorStr.eof(); ++i)
                colorStr >> color[i];
            colors.push_back(color);
        }
    }

    if (!parser.check())
    {
        parser.printErrors();
        return 1;
    }

    CV_Assert(!model.empty());
    //! [Read and initialize network]
    Net net = readNet(model,config);
    net.setPreferableBackend(backendId);
    net.setPreferableTarget(targetId);
    //! [Read and initialize network]

    // Create a window
    static const std::string kWinName = "Deep learning semantic segmentation in OpenCV";
    namedWindow(kWinName, WINDOW_NORMAL);

    //! [Open a video file or an image file or a camera stream]
    VideoCapture cap;
    if (parser.has("input"))
        cap.open(parser.get<String>("input"));
    else
        cap.open(parser.get<int>("device"));
    //! [Open a video file or an image file or a camera stream]

    // Process frames.
    Mat frame, blob;
    while (waitKey(1) < 0)
    {
        cap >> frame;
        if (frame.empty())
        {
            waitKey();
            break;
        }

        if (rszWidth != 0 && rszHeight != 0)
        {
            resize(frame, frame, Size(rszWidth, rszHeight),0,0,INTER_NEAREST);
        }

        //! [Create a 4D blob from a frame]
        blobFromImage(frame, blob, scale, Size(inpWidth, inpHeight), mean, swapRB, false);
        //! [Create a 4D blob from a frame]

                // Check std values.
        if (std.val[0] != 0.0 && std.val[1] != 0.0 && std.val[2] != 0.0)
        {
            // Divide blob by std.
            divide(blob, std, blob);
        }

        //! [Set input blob]
        net.setInput(blob);
        //! [Set input blob]
        //! [Make forward pass]
        Mat score = net.forward();
        //! [Make forward pass]

        Mat segm;
        colorizeSegmentation(score, segm);

        resize(segm, segm, frame.size(), 0, 0, INTER_NEAREST);
        addWeighted(frame, 0.1, segm, 0.9, 0.0, frame);

        // // Put efficiency information.
        std::vector<double> layersTimes;
        double freq = getTickFrequency() / 1000;
        double t = net.getPerfProfile(layersTimes) / freq;
        std::string label = format("Inference time: %.2f ms", t);
        putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));

        imshow(kWinName, frame);
        if (!classes.empty())
            showLegend();
    }
    return 0;
}

void colorizeSegmentation(const Mat &score, Mat &segm)
{
    const int rows = score.size[2];
    const int cols = score.size[3];
    const int chns = score.size[1];

    if (colors.empty())
    {
        // Generate colors.
        colors.push_back(Vec3b());
        for (int i = 1; i < chns; ++i)
        {
            Vec3b color;
            for (int j = 0; j < 3; ++j)
                color[j] = (colors[i - 1][j] + rand() % 256) / 2;
            colors.push_back(color);
        }
    }
    else if (chns != (int)colors.size())
    {
        CV_Error(Error::StsError, format("Number of output classes does not match "
                                         "number of colors (%d != %zu)", chns, colors.size()));
    }

    Mat maxCl = Mat::zeros(rows, cols, CV_8UC1);
    Mat maxVal(rows, cols, CV_32FC1, score.data);
    for (int ch = 1; ch < chns; ch++)
    {
        for (int row = 0; row < rows; row++)
        {
            const float *ptrScore = score.ptr<float>(0, ch, row);
            uint8_t *ptrMaxCl = maxCl.ptr<uint8_t>(row);
            float *ptrMaxVal = maxVal.ptr<float>(row);
            for (int col = 0; col < cols; col++)
            {
                if (ptrScore[col] > ptrMaxVal[col])
                {
                    ptrMaxVal[col] = ptrScore[col];
                    ptrMaxCl[col] = (uchar)ch;
                }
            }
        }
    }

    segm.create(rows, cols, CV_8UC3);
    for (int row = 0; row < rows; row++)
    {
        const uchar *ptrMaxCl = maxCl.ptr<uchar>(row);
        Vec3b *ptrSegm = segm.ptr<Vec3b>(row);
        for (int col = 0; col < cols; col++)
        {
            ptrSegm[col] = colors[ptrMaxCl[col]];
        }
    }
}

void showLegend()
{
    static const int kBlockHeight = 30;
    static Mat legend;
    if (legend.empty())
    {
        const int numClasses = (int)classes.size();
        if ((int)colors.size() != numClasses)
        {
            CV_Error(Error::StsError, format("Number of output classes does not match "
                                             "number of labels (%zu != %zu)", colors.size(), classes.size()));
        }
        legend.create(kBlockHeight * numClasses, 200, CV_8UC3);
        for (int i = 0; i < numClasses; i++)
        {
            Mat block = legend.rowRange(i * kBlockHeight, (i + 1) * kBlockHeight);
            block.setTo(colors[i]);
            putText(block, classes[i], Point(0, kBlockHeight / 2), FONT_HERSHEY_SIMPLEX, 0.5, Vec3b(255, 255, 255));
        }
        namedWindow("Legend", WINDOW_NORMAL);
        imshow("Legend", legend);
    }
}

Pretrained torchvision model to Onnx model converter python code:

import os
import torch
import torch.onnx
from torch.autograd import Variable
from torchvision import models


def get_pytorch_onnx_model(original_model):
    # define the directory for further converted model save
    onnx_model_path = "models"
    # define the name of further converted model
    onnx_model_name = "deeplabv3_resnet101.onnx"

    # create directory for further converted model
    os.makedirs(onnx_model_path, exist_ok=True)

    # get full path to the converted model
    full_model_path = os.path.join(onnx_model_path, onnx_model_name)

    # generate model input
    generated_input = Variable(
        torch.randn(1, 3, 224, 224)
    )

    # model export into ONNX format
    torch.onnx.export(
        original_model,
        generated_input,
        full_model_path,
        verbose=True,
        input_names=["input"],
        output_names=["output"],
        opset_version=11
    )

    return full_model_path


def main():
    # initialize PyTorch ResNet-101 model
    original_model = models.segmentation.deeplabv3_resnet101(pretrained=True)

    # get the path to the converted into ONNX PyTorch model
    full_model_path = get_pytorch_onnx_model(original_model)
    print("PyTorch ResNet-100 model was successfully converted: ", full_model_path)


if __name__ == "__main__":
    main()

can you explain a bit more ?

is there any pytorch code for this model to compare ? there is

opencv version ? example image / output ?

btw, opencv’s mean values are in [0…255], not in [0…1] !
(they’re subtracted before the scaling)
meaning, you have to use

        blobFromImage(frame, blob, scale, Size(inpWidth, inpHeight), mean * 255, swapRB, false);

in the same way, you have to multiply your std values by 255 before the division here