Dnn Action Recognition

MakiCatta · November 18, 2021, 11:29am

Hi,
i know that Dnn Module can load 3D Resnet to perform action recognition or classification in videos.
I want to use other Dnn Models like I3D or R(2+1)D to perform the same task, but this seems not to work. (ONNX correctly loaded but crash during the inference).
Is this possible in any way with OpenCV Dnn Module, or else is this planned in a future version ?
Thanks

berak · November 18, 2021, 11:57am

can you be more specific ?
what are the errors ?
which framework is used to train it ?
link to the code ?
maybe a netron graph ?
how do you feed your data into it ?

such things are done on a “case by case” basis.
if we cannot help from here, raise a github issue.

MakiCatta · December 1, 2021, 7:54am

Thank you Berak and sorry for my late reply

I am not able to modify or share the ONNX files, but here are the netron graphs for Resnet, R3D (instead of I3D) and R2plus1d

The code is

void test_simple()
{
	std::string f_classes("../models/action_recognition_fr.txt");
	vector<string>v_classes;
	std::ifstream ifs(f_classes);
	std::string line;
	while (std::getline(ifs, line))
	{
		v_classes.push_back(line);
	}
	
	//std::string f_onnx("../models/resnet-34_kinetics.onnx"); // OK
	std::string f_onnx("../models/r3d18.onnx"); // NOT OK
	//std::string f_onnx("../models/r2p1d18.onnx"); // NOT OK

	Net net = readNetFromONNX(std::string(f_onnx));
	net.setPreferableTarget(DNN_TARGET_CPU);
	net.setPreferableBackend(DNN_BACKEND_OPENCV);

	std::string f_video("e:/divx/boxe.mkv");
	VideoCapture cap(f_video);

	int sample_duration = 16;
	int sample_size = 112;

	vector<Mat>v_frames;
	v_frames.clear();

	Mat frame_bidon;
	cap >> frame_bidon;

	for (int i = 0; i < sample_duration; i++)
	{
		Mat frame, frame_resized, frame_f;
		cap >> frame;
		resize(frame, frame_resized, Size(sample_size, sample_size));
		frame_resized.convertTo(frame_f, CV_32FC3);
		v_frames.push_back(frame_f);
	}

	Mat blob = blobFromImages(v_frames, 1.0, Size(112, 112), (114.7748, 107.7354, 99.4750), true, true, CV_32F);
	int sz[] = { 1,blob.size[1], blob.size[0], blob.size[2], blob.size[3] };
	Mat newblob = Mat(5, sz, CV_32F, blob.ptr<float>(0));

	Sleep(300);

	net.setInput(newblob);

	Sleep(300);
	
	cout << "Fwd..." << endl;

	Mat score = net.forward();

	cout << "Done." << endl;

	Point pmin, pmax;
	double vmin, vmax;
	minMaxLoc(score, &vmin, &vmax, &pmin, &pmax);
	cout << "Action : " << v_classes[pmax.x] << endl;
}

and the outputs for Resnet, R3D, R2plus1D

PS E:\Work\Z-Presta\Code-THL\VideoAnalysis-opencv\bin> .\ActionReco-v452.exe (Resnet)
Fwd...
Done.
Action : jouer du trombone

PS E:\Work\Z-Presta\Code-THL\VideoAnalysis-opencv\bin> .\ActionReco-v452.exe (R3D)
Fwd...
OpenCV: terminate handler is called! The last OpenCV error is:
OpenCV(4.5.2) Error: Assertion failed (srcMat.dims == 2 && srcMat.cols == weights.cols && dstMat.rows == srcMat.rows && dstMat.cols == weights.rows && srcMat.type() == weights.type() && weights.type() == dstMat.type() && srcMat.type() == CV_32F && (biasMat.empty() || (biasMat.type() == srcMat.type() && biasMat.isContinuous() && (int)biasMat.total() == dstMat.cols))) in cv::dnn::FullyConnectedLayerImpl::FullyConnected::run, file E:\Develop\opencv-4.5.2\modules\dnn\src\layers\fully_connected_layer.cpp, line 180

PS E:\Work\Z-Presta\Code-THL\VideoAnalysis-opencv\bin> .\ActionReco-v452.exe (R2Plus1D)
Fwd...
OpenCV: terminate handler is called! The last OpenCV error is:
OpenCV(4.5.2) Error: Assertion failed (total(os[i]) > 0) in cv::dnn::dnn4_v20210301::Net::Impl::getLayerShapesRecursively, file E:\Develop\opencv-4.5.2\modules\dnn\src\dnn.cpp, line 3534

I Do not know which framework is used to train, but even without the training phase and with random weights, the network shound not cause opencv exception

I do not think that the problem comes from the data, because everything works fine with the first dnn (Resnet).

Thank you for your help !

Edit : OpenCV 4.5.2 / VisualStudio 2017 / Windows10

berak · December 1, 2021, 9:59am

im not 100% sure, but this looks bad:

MakiCatta:

	Mat blob = blobFromImages(v_frames, 1.0, Size(112, 112), (114.7748, 107.7354, 99.4750), true, true, CV_32F);
	int sz[] = { 1,blob.size[1], blob.size[0], blob.size[2], blob.size[3] };
	Mat newblob = Mat(5, sz, CV_32F, blob.ptr<float>(0));

even if it does not crash on resnet, imo you cannot “swizzle” Channels and Time/Batch from

[B,C,H,W]  to [1,C,B,H,W] so easy,

i remember needing a permute layer:

github.com

berak/opencv_smallfry/blob/fd8f64980dff0527523791984d6cb3dfcd2bc9bc/dnn/ar.cpp#L20


      
          using namespace cv;
          using namespace dnn;
          
          // spacing between samples in seconds
          const double SAMPLE_DIST = 0.2;
          
          // the input blob structure for the resnet-34_kinetics action recognition,
          // taken from:
          // https://github.com/opencv/opencv/blob/master/modules/dnn/test/test_onnx_importer.cpp#L611
          //
          Mat blob5D(const std::vector<Mat> &images) {
              Mat blob0 = blobFromImages(images, 1.0, Size(112, 112), Scalar(114.7748, 107.7354, 99.4750), true, true);
          
              Net permute;
              LayerParams lp;
              int order[] = {1, 0, 2, 3};
              lp.set("order", DictValue::arrayInt<int*>(&order[0], 4));
              permute.addLayerToPrev("perm", "Permute", lp);
              permute.setInput(blob0);
              Mat input0 = permute.forward().clone();

and, without seeing code, impossible to say, if those dimensions are correct for other networks …

[edit]
ok, could reproduce problem with r2p1d18 network on colab:

import numpy as np, cv2

net = cv2.dnn.readNet("r2p1d18.onnx")
dat = np.ones((1,3,16,112,112),np.float32)
net.setInput(dat)
res = net.forward()


----> 6 res = net.forward()
      7 print(res)

error: OpenCV(4.5.3-dev) /content/opencv/modules/dnn/src/dnn.cpp:3564: error: (-215:Assertion failed) total(os[i]) > 0 in function 'getLayerShapesRecursively'

MakiCatta · December 2, 2021, 10:40am

Thank you Berak for your advice about swapping the channels . Here is my new code

void test_simple()
{
	std::string f_classes("../models/action_recognition_fr.txt");
	vector<string>v_classes;
	std::ifstream ifs(f_classes);
	std::string line;
	while (std::getline(ifs, line))
	{
		v_classes.push_back(line);
	}
	
	//std::string f_onnx("../models/resnet-34_kinetics.onnx"); // OK
	//std::string f_onnx("../models/r3d18.onnx"); // NOT OK
	std::string f_onnx("../models/r2p1d18.onnx"); // NOT OK

	Net net = readNetFromONNX(std::string(f_onnx));
	net.setPreferableTarget(DNN_TARGET_CPU);
	net.setPreferableBackend(DNN_BACKEND_OPENCV);

	std::string f_video("e:/divx/boxe.mkv");
	VideoCapture cap(f_video);

	int sample_duration = 16;
	int sample_size = 112;

	vector<Mat>v_frames;
	v_frames.clear();

	Mat frame_bidon;
	cap >> frame_bidon;

	for (int i = 0; i < sample_duration; i++)
	{
		Mat frame, frame_resized, frame_f;
		cap >> frame;
		resize(frame, frame_resized, Size(sample_size, sample_size));
		frame_resized.convertTo(frame_f, CV_32FC3);
		v_frames.push_back(frame_f);
	}

	Mat blob = blobFromImages(v_frames, 1.0, Size(112, 112), (114.7748, 107.7354, 99.4750), true, true, CV_32F);
	
	// permutation
	Net permute;
	LayerParams lp;
	int order[] = { 1, 0, 2, 3 };
	lp.set("order", DictValue::arrayInt<int*>(&order[0], 4));
	permute.addLayerToPrev("perm", "Permute", lp);
	permute.setInput(blob);
	Mat input0 = permute.forward().clone();

	int sz[] = { 1, blob.size[1],blob.size[0], blob.size[2], blob.size[3] };
	cout << sz[0] << "," << sz[1] << "," << sz[2] << "," << sz[3] << "," << sz[4] << endl; // OK
	Mat newblob = Mat(5, sz, CV_32F, input0.ptr<float>(0));

	Sleep(300);
	net.setInput(newblob);
	Sleep(300);
	
	cout << "Fwd..." << endl;
	Mat score = net.forward();
	cout << "Done." << endl;

	Point pmin, pmax;
	double vmin, vmax;
	minMaxLoc(score, &vmin, &vmax, &pmin, &pmax);
	cout << "Action : " << v_classes[pmax.x] << endl;
}

With ResNet, I now obtain the good classification “Hit someone (Boxing)” instead of “play the trombone”. That means this correction was necessary.

Unfortunately, the R3D and R2Dplus1 still gives the same exceptions. (and you confirm this for R2+1).

berak · December 2, 2021, 10:55am

yea, could not get any of those to run, either.

btw, the paper mentions “equidistant” sampling of frames from an action video,
sampling 16 sequential frames will give worse results

MakiCatta · December 5, 2021, 6:01pm

OK It reassures me. Do you think it is a bug, or that is due to a layer which is not yet implemented ? What should I do ?

(and thanks for the advice, I keep it in mind).

berak · December 5, 2021, 6:12pm

yes. raise an issue here: Sign in to GitHub · GitHub

MakiCatta · December 16, 2021, 9:17am

Issue raised here (if anyone want to follow)

https://github.com/opencv/opencv/issues/21270

Topic		Replies	Views
OpenCV DNN changing my input dimension C++ dnn	10	1023	June 12, 2023
Error while loading onnx model (exported from PyTorch model) in object detection program using dnn module of opencv Python dnn , onnx	3	1097	November 24, 2021
Dnn textrecognition model onnx C++ dnn	3	992	August 26, 2022
DNN::Network::forward() works in Python but not C++ C++ dnn	4	2520	March 14, 2022
Error: Unspecified error (Can't create layer "model/simple_rnn/TensorArrayV2_1:0" of type "TensorListReserve" C++ dnn	9	749	February 14, 2022

Dnn Action Recognition

Related topics