What I want to do
I’m doing some CV deeplearning deployment optimization.Some of my models need their input to be normalized. So preprocessing normalization became an optimization point. Pixel value is between [0, 255], so first thing is multiplying it with 1 / 255.0, which is my first method. After some google I found LUT which theoretically should be faster than float calculation. So I wrote code like below to test the two methods:
Test code
#include "opencv2/imgcodecs/imgcodecs.hpp"
#include "opencv2/imgproc/imgproc.hpp"
#include <chrono>
#include <dirent.h>
#include <iostream>
#include <string>
#include <vector>
int getFiles(const std::string path, std::vector<std::string>& files, std::string suffix)
{
int iFileCnt = 0;
DIR* dirptr = NULL;
struct dirent* dirp;
if ((dirptr = opendir(path.c_str())) == NULL) {
return 0;
}
while ((dirp = readdir(dirptr)) != NULL) {
if ((dirp->d_type == DT_REG) && 0 == (strcmp(strchr(dirp->d_name, '.'), suffix.c_str()))) {
files.push_back(dirp->d_name);
}
++iFileCnt;
}
closedir(dirptr);
return iFileCnt;
}
int main(int argc, char* argv[])
{
std::string pic_dir = argv[1];
int loop_count = 10;
if (argc >= 3) {
loop_count = std::stoi(argv[2]);
}
float FACTOR = 1 / 255.0;
std::vector<cv::Size> sizes = {
{299, 299},
{416, 416},
{512, 512},
{640, 640},
{960, 540},
{1920, 1080}
};
// std::vector<cv::Size> sizes = {
// {1920, 1080},
// {960, 540},
// {640, 640},
// {512, 512},
// {416, 416},
// {299, 299}
// };
cv::Mat table(1, 256, CV_32FC1);
auto ptr = table.ptr<float>(0);
for (int i = 0; i < 256; ++i) {
ptr[i] = float(i) * FACTOR;
}
std::vector<std::string> pic_files;
getFiles(pic_dir, pic_files, ".jpg");
std::vector<cv::Mat> image_mats(pic_files.size());
for (int i = 0; i < pic_files.size(); ++i) {
std::string one_pic_path = pic_dir + "/" + pic_files[i];
image_mats[i] = cv::imread(one_pic_path);
}
for (auto& one_size : sizes) {
std::cout << "size: " << one_size << std::endl;
double time_1 = 0;
double time_2 = 0;
for (auto& one_mat : image_mats) {
cv::Mat tmp_image;
cv::resize(one_mat, tmp_image, one_size);
for (int i = 0; i < loop_count; ++i) {
auto t_1_1 = std::chrono::steady_clock::now();
cv::Mat out_1;
tmp_image.convertTo(out_1, CV_32FC3, FACTOR);
auto t_1_2 = std::chrono::steady_clock::now();
time_1 += std::chrono::duration<double, std::milli>(t_1_2 - t_1_1).count();
auto t_2_1 = std::chrono::steady_clock::now();
cv::Mat out_2;
cv::LUT(tmp_image, table, out_2);
auto t_2_2 = std::chrono::steady_clock::now();
time_2 += std::chrono::duration<double, std::milli>(t_2_2 - t_2_1).count();
auto diff = cv::sum(out_1 - out_2);
if (diff[0] > 1E-3) {
std::cout << diff << std::endl;
}
}
}
size_t count = loop_count * image_mats.size();
auto average_time_1 = time_1 / count;
auto average_time_2 = time_2 / count;
auto promote_percent = (average_time_1 - average_time_2) / average_time_1 * 100;
printf("total pic num: %d, loop %d times\n", pic_files.size(), loop_count);
printf("method_1, total %f ms, average %f ms\n", time_1, average_time_1);
printf("method_2, total %f ms, average %f ms, promote: %.2f%\n", time_2, average_time_2,
promote_percent);
printf("\n");
}
return 0;
}
Weird performance
What I want to test is speed difference between two methods, with different input sizes, while the outputs of two methods should be equal. I took 128 pictures with different sizes for test. Here is the weird performance:
1. result of the code above
size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total 38.872174 ms, average 0.030369 ms
method_2, total 330.688332 ms, average 0.258350 ms, promote: -750.71%
size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total 103.708926 ms, average 0.081023 ms
method_2, total 689.972421 ms, average 0.539041 ms, promote: -565.30%
size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total 267.989430 ms, average 0.209367 ms
method_2, total 450.809036 ms, average 0.352195 ms, promote: -68.22%
size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total 757.269510 ms, average 0.591617 ms
method_2, total 551.951118 ms, average 0.431212 ms, promote: 27.11%
size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total 1095.167540 ms, average 0.855600 ms
method_2, total 760.330269 ms, average 0.594008 ms, promote: 30.57%
size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total 4944.142104 ms, average 3.862611 ms
method_2, total 3471.176202 ms, average 2.711856 ms, promote: 29.79%
2. comment the diff part:
//auto diff = cv::sum(out_1 - out_2);
//if (diff[0] > 1E-3) {
// std::cout << diff << std::endl;
//}
size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total 246.356823 ms, average 0.192466 ms
method_2, total 361.859598 ms, average 0.282703 ms, promote: -46.88%
size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total 516.542233 ms, average 0.403549 ms
method_2, total 719.191240 ms, average 0.561868 ms, promote: -39.23%
size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total 839.599260 ms, average 0.655937 ms
method_2, total 342.608080 ms, average 0.267663 ms, promote: 59.19%
size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total 1384.348467 ms, average 1.081522 ms
method_2, total 524.382672 ms, average 0.409674 ms, promote: 62.12%
size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total 1796.153597 ms, average 1.403245 ms
method_2, total 688.210851 ms, average 0.537665 ms, promote: 61.68%
size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total 7707.945924 ms, average 6.021833 ms
method_2, total 3812.262622 ms, average 2.978330 ms, promote: 50.54%
3. Uncomment the diff part but reverse the sizes vector
std::vector<cv::Size> sizes = {
{1920, 1080},
{960, 540},
{640, 640},
{512, 512},
{416, 416},
{299, 299}
};
...
auto diff = cv::sum(out_1 - out_2);
if (diff[0] > 1E-3) {
std::cout << diff << std::endl;
}
size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total 4933.384896 ms, average 3.854207 ms
method_2, total 3563.611341 ms, average 2.784071 ms, promote: 27.77%
size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total 887.353187 ms, average 0.693245 ms
method_2, total 917.995079 ms, average 0.717184 ms, promote: -3.45%
size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total 492.562282 ms, average 0.384814 ms
method_2, total 525.089826 ms, average 0.410226 ms, promote: -6.60%
size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total 181.900041 ms, average 0.142109 ms
method_2, total 159.691528 ms, average 0.124759 ms, promote: 12.21%
size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total 77.030586 ms, average 0.060180 ms
method_2, total 221.307936 ms, average 0.172897 ms, promote: -187.30%
size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total 38.139366 ms, average 0.029796 ms
method_2, total 112.203023 ms, average 0.087659 ms, promote: -194.19%
4. Comment the diff part and reverse the sizes vector
std::vector<cv::Size> sizes = {
{1920, 1080},
{960, 540},
{640, 640},
{512, 512},
{416, 416},
{299, 299}
};
...
//auto diff = cv::sum(out_1 - out_2);
//if (diff[0] > 1E-3) {
// std::cout << diff << std::endl;
//}
size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total 8021.875493 ms, average 6.267090 ms
method_2, total 3849.222334 ms, average 3.007205 ms, promote: 52.02%
size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total 605.553580 ms, average 0.473089 ms
method_2, total 477.145896 ms, average 0.372770 ms, promote: 21.21%
size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total 268.076975 ms, average 0.209435 ms
method_2, total 169.015667 ms, average 0.132043 ms, promote: 36.95%
size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total 117.419851 ms, average 0.091734 ms
method_2, total 94.436479 ms, average 0.073778 ms, promote: 19.57%
size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total 73.963177 ms, average 0.057784 ms
method_2, total 221.397616 ms, average 0.172967 ms, promote: -199.33%
size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total 38.046131 ms, average 0.029724 ms
method_2, total 113.839007 ms, average 0.088937 ms, promote: -199.21%
Question
I know cpu working state may undulate, and speed performance may not be exactly the same. But Why does code outside timekeeping have such influence on convertTo and LUT?