The OUTPUT of the ONNX model shows a strange image

I export a model I train with python to ONNX and import it with ONNX runtime in C++. But when I convert it to OpenCV Mat, I get a quite different result than it should be.
The input of the model is 1,3,224,224. When I examine the ONNX model with Netron, its output is 1,512,28,28. OpenCV shows this image as 28x28 pixels and the colors are also quite strange. Colors are dark and saturated.
I just asked a question on stackoverflow for this. They advised me to transpose the image. But I couldn’t find how to do this with OpenCV. To show images during training in the python code is

def tensor_to_image(tensor):
img = (255 * tensor).cpu().detach().squeeze(0).numpy()
img = img.clip(0, 255).transpose(1, 2, 0).astype(“uint8”)
return Image.fromarray(img)

But in C++ I couldn’t do this with OpenCV.
All I want is for the output form to work as 1,3,224,224. Should I do this during model training? Or is it possible to do this with OpenCV?
Can you help me please?

3*224*224 != 512*28*28
so you cant simply reshape / transform it.
(there seems to be some wrong assumption here)

please tell us more about your model. what does it do ? what is it for ? what is the “meaning” of the output ? what exact output you get from the python code ? how does your current c++ attempt look like ?

I use this code for training my model

as you can see the code uses the ready VGG model with VGG = tv.models.vgg19(pretrained=True).features instead of a dataset. That’s why it gives an output like 1,512,28,28 at the end. In C++, I import this model using the ONNX runtime. I guess what I need to do here is transpose or reshape. But I can’t do this part. With matplotlib in Python, we can easily display the output as an image. But I couldn’t find anything like that in OpenCV.

This is my Model’s schema :

In the C++ part, I adapted the GitHub - Mut1nyJD/styletransferonnx: Simple inference for NeuralStyleTransfer models from Mut1ny codes to OpenCV. This code was using the FreeImage library but I converted it to Mat with this code
int depth = FreeImage_GetBPP(dib);
printf(“depth = %d\n”, FreeImage_GetPitch(dib));
cv::Mat img(FreeImage_GetHeight(dib), FreeImage_GetWidth(dib), CV_MAKETYPE(depth / 3, 3), FreeImage_GetBits(dib), FreeImage_GetPitch(dib));

cv::imshow(“img”, img);

1 Like

I tried this , too :

	net = cv::dnn::readNetFromONNX(stylepath);
	size_t h = image.rows;
	size_t w = image.cols;

	Mat inputBlob;
	inputBlob = cv::dnn::blobFromImage(image, 1.0, Size(width, height), Scalar(103.939, 116.779, 123.680), false, false);//


	Mat out = net.forward();

	std::vector<cv::Mat> Styled2;
	imagesFromBlob(out, Styled2);
	Mat Styled;
	Styled /= 255;

But it crashes. I guess I must reshape after ‘Mat out = net.forward();’

what is out.size (yes, w/o braces !) here ?
please report output of:

cout << out.size << endl;

a generative nn for style transfer should result in a single image, what are those 512 smallish maps about ? it does not make any sense to me.

In the sample code you provided:
out = out.reshape(3, out.shape[2], out.shape[3])
out[0] += 103.939
out[1] += 116.779
out[2] += 123.68
out /= 255
out = out.transpose(1, 2, 0)

It works in python. Anyway, what I need to do is to find the equivalent of out = out.transpose(1, 2, 0) function in C++.

but in OpenCV C++, transpose for opencv mat only works as out=cv::transpose(input mat,output mat). I guess that’s my main problem.
Actually, I think if I can do out.transpose(1,2,0) the problem will be solved.
Also, OpenCV does not work as out = out.reshape(3, out.shape[2], out.shape[3]) in C++.
It only accepts 2 variables for reshape.
Like out = out.reshape(1,2)…
I also tried out = out.reshape(1,{width,height}) in C++ but it didn’t work.
The notebook codes I sent you work in python. But it’s really weird that the output is like 1,512,28,28. Even if I perform the transpose, I still can’t understand how the output image sizes are converted to 512x512.

By the way, I am sharing the information you asked about the model:
Size : 1 x 512 x 26 x 40
Type: 5
Channels : 1
Columns: -1
Rows: -1

can you put the onnx somewhere, so ppl can try ?

with the t7 or with your onnx ?

indeed, numpy’s transpose / reshape work different than opencv’s.
while there are ways to mend this from c++ (split/merge/transposeND)
it’s all no use, as long as the numbers dont fit. there’s no way to make [1,3,224,224] from [1,512,28,40]. something’s amiss here

I think you are right. Actually, I may have misunderstood because I don’t have enough knowledge on this subject. Maybe I’m missing something when exporting from pytorch to onnx. I am posting the pth and onnx files I obtained. Can you please have a look?

1 Like

Finally I realized that this model gives feature maps as output. I also tried with Resnet50, same way. I thought it would output image directly. Of course, now I have to figure out how to process these feature maps in Opencv c++.

do you have a link for this ? and how should those be interpreted ?

checked your data on colab:

nn = cv2.dnn.readNet("onnxmodel.onnx")
inp = np.ones((1,3,224,224), np.float32) #dummy
out = nn.forward()
#(1, 512, 7, 7)
!dot style.txt -Tsvg -ostyle.svg

same result from torch, btw:

import torch
import torchvision as tv

VGG = tv.models.vgg19(pretrained=True).features
dummy = torch.randn((1,3,224,224),dtype=torch.float32)
out = VGG(dummy)
print (out.shape)
# torch.Size([1, 512, 7, 7])

Unfortunately I couldn’t find anything on how to interpret it. But as far as I understand, I think 512 channels in output, and the other two may have style and content loss.
Actually, I think this style transfer training code is a very classic code. It was interesting to me that I couldn’t find any information about it. I couldn’t understand how it easily converts it to an image in the styletransfer.ipnyb file I sent at the beginning. The only thing I understand is that he already knows how to interpret a feature map on that notebook. I can’t figure out how to adapt this to C++ Opencv.
Now I have to explore the content of the data I got after net.forward() with the opencv::dnn module in C++.
The only thing I can find regarding the VGG style transfer engine is the following schematic:
"Thw next parameters are the heart of the algorithm, They deeply affect the result of combining style and content, because they define what IS the style and content.

To compute style and content, we use a deep neural network called VGG. Here is how VGG looks like inside.
Tu compute the style of an image, we run the image through VGG. Then, we look at what values flowed across the layers (named conv1_1, conv1_2, etc.) and keep some of them.

For the content, we usually keep the first layers. The last convolutional layer from the second block is always a good bet.

For the style, a mix of the first conv layers (they contain color and texture informations) and last layers (they contain complexe features like trees, shapes, eyes, etc…) gives the best results.

The names of layers you can add the the following lists in order to define the style and content loss are:

  • conv_n where n is the index of the convolutional layer from the input.
  • relu_n where n is the index of the rectified linear layer from the input.
  • pool_n where n is the index of the max pooling layer from the input.
  • bn_n zhere n is the index of the batch normalization layer from the input.

For example, convolutional layers are indexed from conv_1 to conv_16."

source : style-transfer/Pytorch_Style_Transfer.ipynb at master · jeremycochoy/style-transfer · GitHub

apologies for our kismet spambot misbehaving AGAIN !

(it wrongly flagged your posts as spam, then you fell into some panic mode, and repeated the last post 4 times (understandably !), i took liberty to remove 3 of them, hope, it’s ok …)

Yes, I panicked when my posts suddenly disappeared. I’m sorry for posting too quickly.

By the way, output shape (1,512,somenumber,somenumber) is independent of input shape. More precisely, the number 512 does not specify the input size of the image. Even when I make the inputshape 1,3,224,224 it gives the same output. I’m quite confused.

the graphics is nice !
and it shows, that “our model” here ends with the 7x7x512 maxpool (no fc layers)

and i took a closer look at the training notebook:
normally i’d expect to see some inference like:

result = VGG(input)

but the result is nowwhere used here. instead it seems to use (&manipulate) the input image “by reference”, very weird …