Dnn Action Recognition

Hi,
i know that Dnn Module can load 3D Resnet to perform action recognition or classification in videos.
I want to use other Dnn Models like I3D or R(2+1)D to perform the same task, but this seems not to work. (ONNX correctly loaded but crash during the inference).
Is this possible in any way with OpenCV Dnn Module, or else is this planned in a future version ?
Thanks

can you be more specific ?
what are the errors ?
which framework is used to train it ?
link to the code ?
maybe a netron graph ?
how do you feed your data into it ?

such things are done on a “case by case” basis.
if we cannot help from here, raise a github issue.

Thank you Berak and sorry for my late reply

I am not able to modify or share the ONNX files, but here are the netron graphs for Resnet, R3D (instead of I3D) and R2plus1d

The code is

void test_simple()
{
	std::string f_classes("../models/action_recognition_fr.txt");
	vector<string>v_classes;
	std::ifstream ifs(f_classes);
	std::string line;
	while (std::getline(ifs, line))
	{
		v_classes.push_back(line);
	}
	
	//std::string f_onnx("../models/resnet-34_kinetics.onnx"); // OK
	std::string f_onnx("../models/r3d18.onnx"); // NOT OK
	//std::string f_onnx("../models/r2p1d18.onnx"); // NOT OK

	Net net = readNetFromONNX(std::string(f_onnx));
	net.setPreferableTarget(DNN_TARGET_CPU);
	net.setPreferableBackend(DNN_BACKEND_OPENCV);

	std::string f_video("e:/divx/boxe.mkv");
	VideoCapture cap(f_video);

	int sample_duration = 16;
	int sample_size = 112;

	vector<Mat>v_frames;
	v_frames.clear();

	Mat frame_bidon;
	cap >> frame_bidon;

	for (int i = 0; i < sample_duration; i++)
	{
		Mat frame, frame_resized, frame_f;
		cap >> frame;
		resize(frame, frame_resized, Size(sample_size, sample_size));
		frame_resized.convertTo(frame_f, CV_32FC3);
		v_frames.push_back(frame_f);
	}

	Mat blob = blobFromImages(v_frames, 1.0, Size(112, 112), (114.7748, 107.7354, 99.4750), true, true, CV_32F);
	int sz[] = { 1,blob.size[1], blob.size[0], blob.size[2], blob.size[3] };
	Mat newblob = Mat(5, sz, CV_32F, blob.ptr<float>(0));

	Sleep(300);

	net.setInput(newblob);

	Sleep(300);
	
	cout << "Fwd..." << endl;

	Mat score = net.forward();

	cout << "Done." << endl;

	Point pmin, pmax;
	double vmin, vmax;
	minMaxLoc(score, &vmin, &vmax, &pmin, &pmax);
	cout << "Action : " << v_classes[pmax.x] << endl;
}

and the outputs for Resnet, R3D, R2plus1D

PS E:\Work\Z-Presta\Code-THL\VideoAnalysis-opencv\bin> .\ActionReco-v452.exe (Resnet)
Fwd...
Done.
Action : jouer du trombone

PS E:\Work\Z-Presta\Code-THL\VideoAnalysis-opencv\bin> .\ActionReco-v452.exe (R3D)
Fwd...
OpenCV: terminate handler is called! The last OpenCV error is:
OpenCV(4.5.2) Error: Assertion failed (srcMat.dims == 2 && srcMat.cols == weights.cols && dstMat.rows == srcMat.rows && dstMat.cols == weights.rows && srcMat.type() == weights.type() && weights.type() == dstMat.type() && srcMat.type() == CV_32F && (biasMat.empty() || (biasMat.type() == srcMat.type() && biasMat.isContinuous() && (int)biasMat.total() == dstMat.cols))) in cv::dnn::FullyConnectedLayerImpl::FullyConnected::run, file E:\Develop\opencv-4.5.2\modules\dnn\src\layers\fully_connected_layer.cpp, line 180

PS E:\Work\Z-Presta\Code-THL\VideoAnalysis-opencv\bin> .\ActionReco-v452.exe (R2Plus1D)
Fwd...
OpenCV: terminate handler is called! The last OpenCV error is:
OpenCV(4.5.2) Error: Assertion failed (total(os[i]) > 0) in cv::dnn::dnn4_v20210301::Net::Impl::getLayerShapesRecursively, file E:\Develop\opencv-4.5.2\modules\dnn\src\dnn.cpp, line 3534

I Do not know which framework is used to train, but even without the training phase and with random weights, the network shound not cause opencv exception

I do not think that the problem comes from the data, because everything works fine with the first dnn (Resnet).

Thank you for your help !

Edit : OpenCV 4.5.2 / VisualStudio 2017 / Windows10

im not 100% sure, but this looks bad:

even if it does not crash on resnet, imo you cannot “swizzle” Channels and Time/Batch from

[B,C,H,W]  to [1,C,B,H,W] so easy, 

i remember needing a permute layer:

and, without seeing code, impossible to say, if those dimensions are correct for other networks …

[edit]
ok, could reproduce problem with r2p1d18 network on colab:

import numpy as np, cv2

net = cv2.dnn.readNet("r2p1d18.onnx")
dat = np.ones((1,3,16,112,112),np.float32)
net.setInput(dat)
res = net.forward()


----> 6 res = net.forward()
      7 print(res)

error: OpenCV(4.5.3-dev) /content/opencv/modules/dnn/src/dnn.cpp:3564: error: (-215:Assertion failed) total(os[i]) > 0 in function 'getLayerShapesRecursively'


Thank you Berak for your advice about swapping the channels . Here is my new code

void test_simple()
{
	std::string f_classes("../models/action_recognition_fr.txt");
	vector<string>v_classes;
	std::ifstream ifs(f_classes);
	std::string line;
	while (std::getline(ifs, line))
	{
		v_classes.push_back(line);
	}
	
	//std::string f_onnx("../models/resnet-34_kinetics.onnx"); // OK
	//std::string f_onnx("../models/r3d18.onnx"); // NOT OK
	std::string f_onnx("../models/r2p1d18.onnx"); // NOT OK

	Net net = readNetFromONNX(std::string(f_onnx));
	net.setPreferableTarget(DNN_TARGET_CPU);
	net.setPreferableBackend(DNN_BACKEND_OPENCV);

	std::string f_video("e:/divx/boxe.mkv");
	VideoCapture cap(f_video);

	int sample_duration = 16;
	int sample_size = 112;

	vector<Mat>v_frames;
	v_frames.clear();

	Mat frame_bidon;
	cap >> frame_bidon;

	for (int i = 0; i < sample_duration; i++)
	{
		Mat frame, frame_resized, frame_f;
		cap >> frame;
		resize(frame, frame_resized, Size(sample_size, sample_size));
		frame_resized.convertTo(frame_f, CV_32FC3);
		v_frames.push_back(frame_f);
	}

	Mat blob = blobFromImages(v_frames, 1.0, Size(112, 112), (114.7748, 107.7354, 99.4750), true, true, CV_32F);
	
	// permutation
	Net permute;
	LayerParams lp;
	int order[] = { 1, 0, 2, 3 };
	lp.set("order", DictValue::arrayInt<int*>(&order[0], 4));
	permute.addLayerToPrev("perm", "Permute", lp);
	permute.setInput(blob);
	Mat input0 = permute.forward().clone();

	int sz[] = { 1, blob.size[1],blob.size[0], blob.size[2], blob.size[3] };
	cout << sz[0] << "," << sz[1] << "," << sz[2] << "," << sz[3] << "," << sz[4] << endl; // OK
	Mat newblob = Mat(5, sz, CV_32F, input0.ptr<float>(0));

	Sleep(300);
	net.setInput(newblob);
	Sleep(300);
	
	cout << "Fwd..." << endl;
	Mat score = net.forward();
	cout << "Done." << endl;

	Point pmin, pmax;
	double vmin, vmax;
	minMaxLoc(score, &vmin, &vmax, &pmin, &pmax);
	cout << "Action : " << v_classes[pmax.x] << endl;
}

With ResNet, I now obtain the good classification “Hit someone (Boxing)” instead of “play the trombone”. That means this correction was necessary.

Unfortunately, the R3D and R2Dplus1 still gives the same exceptions. (and you confirm this for R2+1).

yea, could not get any of those to run, either.

btw, the paper mentions “equidistant” sampling of frames from an action video,
sampling 16 sequential frames will give worse results

OK It reassures me. Do you think it is a bug, or that is due to a layer which is not yet implemented ? What should I do ?

(and thanks for the advice, I keep it in mind).

yes. raise an issue here: Sign in to GitHub · GitHub

Issue raised here (if anyone want to follow)

https://github.com/opencv/opencv/issues/21270

2 Likes