Weird speed performance of convertTo and LUT

Chester_Z · September 8, 2021, 9:54pm

What I want to do

I’m doing some CV deeplearning deployment optimization.Some of my models need their input to be normalized. So preprocessing normalization became an optimization point. Pixel value is between [0, 255], so first thing is multiplying it with 1 / 255.0, which is my first method. After some google I found LUT which theoretically should be faster than float calculation. So I wrote code like below to test the two methods:

Test code

#include "opencv2/imgcodecs/imgcodecs.hpp"
#include "opencv2/imgproc/imgproc.hpp"
#include <chrono>
#include <dirent.h>
#include <iostream>
#include <string>
#include <vector>

int getFiles(const std::string path, std::vector<std::string>& files, std::string suffix)
{
    int iFileCnt = 0;
    DIR* dirptr = NULL;
    struct dirent* dirp;

    if ((dirptr = opendir(path.c_str())) == NULL) {
        return 0;
    }
    while ((dirp = readdir(dirptr)) != NULL) {
        if ((dirp->d_type == DT_REG) && 0 == (strcmp(strchr(dirp->d_name, '.'), suffix.c_str()))) {
            files.push_back(dirp->d_name);
        }
        ++iFileCnt;
    }
    closedir(dirptr);

    return iFileCnt;
}

int main(int argc, char* argv[])
{
    std::string pic_dir = argv[1];
    int loop_count = 10;
    if (argc >= 3) {
        loop_count = std::stoi(argv[2]);
    }
    float FACTOR = 1 / 255.0;

    std::vector<cv::Size> sizes = {
        {299, 299},
        {416, 416},
        {512, 512},
        {640, 640},
        {960, 540},
        {1920, 1080}
    };
    // std::vector<cv::Size> sizes = {
    //     {1920, 1080},
    //     {960, 540},
    //     {640, 640},
    //     {512, 512},
    //     {416, 416},
    //     {299, 299}
    // };

    cv::Mat table(1, 256, CV_32FC1);
    auto ptr = table.ptr<float>(0);
    for (int i = 0; i < 256; ++i) {
        ptr[i] = float(i) * FACTOR;
    }

    std::vector<std::string> pic_files;
    getFiles(pic_dir, pic_files, ".jpg");
    std::vector<cv::Mat> image_mats(pic_files.size());
    for (int i = 0; i < pic_files.size(); ++i) {
        std::string one_pic_path = pic_dir + "/" + pic_files[i];
        image_mats[i] = cv::imread(one_pic_path);
    }

    for (auto& one_size : sizes) {
        std::cout << "size: " << one_size << std::endl;

        double time_1 = 0;
        double time_2 = 0;
        for (auto& one_mat : image_mats) {
            cv::Mat tmp_image;
            cv::resize(one_mat, tmp_image, one_size);
            for (int i = 0; i < loop_count; ++i) {
                auto t_1_1 = std::chrono::steady_clock::now();
                cv::Mat out_1;
                tmp_image.convertTo(out_1, CV_32FC3, FACTOR);
                auto t_1_2 = std::chrono::steady_clock::now();
                time_1 += std::chrono::duration<double, std::milli>(t_1_2 - t_1_1).count();

                auto t_2_1 = std::chrono::steady_clock::now();
                cv::Mat out_2;
                cv::LUT(tmp_image, table, out_2);
                auto t_2_2 = std::chrono::steady_clock::now();
                time_2 += std::chrono::duration<double, std::milli>(t_2_2 - t_2_1).count();

                auto diff = cv::sum(out_1 - out_2);
                if (diff[0] > 1E-3) {
                    std::cout << diff << std::endl;
                }
            }
        }
        size_t count = loop_count * image_mats.size();
        auto average_time_1 = time_1 / count;
        auto average_time_2 = time_2 / count;
        auto promote_percent = (average_time_1 - average_time_2) / average_time_1 * 100;
        printf("total pic num: %d, loop %d times\n", pic_files.size(), loop_count);
        printf("method_1, total  %f  ms, average  %f ms\n", time_1, average_time_1);
        printf("method_2, total  %f  ms, average  %f ms, promote: %.2f%\n", time_2, average_time_2,
               promote_percent);
        printf("\n");
    }

    return 0;
}

Weird performance

What I want to test is speed difference between two methods, with different input sizes, while the outputs of two methods should be equal. I took 128 pictures with different sizes for test. Here is the weird performance:

1. result of the code above

size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total  38.872174  ms, average  0.030369 ms
method_2, total  330.688332  ms, average  0.258350 ms, promote: -750.71%

size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total  103.708926  ms, average  0.081023 ms
method_2, total  689.972421  ms, average  0.539041 ms, promote: -565.30%

size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total  267.989430  ms, average  0.209367 ms
method_2, total  450.809036  ms, average  0.352195 ms, promote: -68.22%

size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total  757.269510  ms, average  0.591617 ms
method_2, total  551.951118  ms, average  0.431212 ms, promote: 27.11%

size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total  1095.167540  ms, average  0.855600 ms
method_2, total  760.330269  ms, average  0.594008 ms, promote: 30.57%

size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total  4944.142104  ms, average  3.862611 ms
method_2, total  3471.176202  ms, average  2.711856 ms, promote: 29.79%

2. comment the diff part:

//auto diff = cv::sum(out_1 - out_2);
//if (diff[0] > 1E-3) {
//    std::cout << diff << std::endl;
//}

size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total  246.356823  ms, average  0.192466 ms
method_2, total  361.859598  ms, average  0.282703 ms, promote: -46.88%

size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total  516.542233  ms, average  0.403549 ms
method_2, total  719.191240  ms, average  0.561868 ms, promote: -39.23%

size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total  839.599260  ms, average  0.655937 ms
method_2, total  342.608080  ms, average  0.267663 ms, promote: 59.19%

size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total  1384.348467  ms, average  1.081522 ms
method_2, total  524.382672  ms, average  0.409674 ms, promote: 62.12%

size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total  1796.153597  ms, average  1.403245 ms
method_2, total  688.210851  ms, average  0.537665 ms, promote: 61.68%

size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total  7707.945924  ms, average  6.021833 ms
method_2, total  3812.262622  ms, average  2.978330 ms, promote: 50.54%

3. Uncomment the diff part but reverse the sizes vector

std::vector<cv::Size> sizes = {
        {1920, 1080},
        {960, 540},
        {640, 640},
        {512, 512},
        {416, 416},
        {299, 299}
    };

...

auto diff = cv::sum(out_1 - out_2);
if (diff[0] > 1E-3) {
   std::cout << diff << std::endl;
}

size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total  4933.384896  ms, average  3.854207 ms
method_2, total  3563.611341  ms, average  2.784071 ms, promote: 27.77%

size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total  887.353187  ms, average  0.693245 ms
method_2, total  917.995079  ms, average  0.717184 ms, promote: -3.45%

size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total  492.562282  ms, average  0.384814 ms
method_2, total  525.089826  ms, average  0.410226 ms, promote: -6.60%

size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total  181.900041  ms, average  0.142109 ms
method_2, total  159.691528  ms, average  0.124759 ms, promote: 12.21%

size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total  77.030586  ms, average  0.060180 ms
method_2, total  221.307936  ms, average  0.172897 ms, promote: -187.30%

size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total  38.139366  ms, average  0.029796 ms
method_2, total  112.203023  ms, average  0.087659 ms, promote: -194.19%

4. Comment the diff part and reverse the sizes vector

std::vector<cv::Size> sizes = {
        {1920, 1080},
        {960, 540},
        {640, 640},
        {512, 512},
        {416, 416},
        {299, 299}
    };

...

//auto diff = cv::sum(out_1 - out_2);
//if (diff[0] > 1E-3) {
//   std::cout << diff << std::endl;
//}

size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total  8021.875493  ms, average  6.267090 ms
method_2, total  3849.222334  ms, average  3.007205 ms, promote: 52.02%

size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total  605.553580  ms, average  0.473089 ms
method_2, total  477.145896  ms, average  0.372770 ms, promote: 21.21%

size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total  268.076975  ms, average  0.209435 ms
method_2, total  169.015667  ms, average  0.132043 ms, promote: 36.95%

size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total  117.419851  ms, average  0.091734 ms
method_2, total  94.436479  ms, average  0.073778 ms, promote: 19.57%

size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total  73.963177  ms, average  0.057784 ms
method_2, total  221.397616  ms, average  0.172967 ms, promote: -199.33%

size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total  38.046131  ms, average  0.029724 ms
method_2, total  113.839007  ms, average  0.088937 ms, promote: -199.21%

Question

I know cpu working state may undulate, and speed performance may not be exactly the same. But Why does code outside timekeeping have such influence on convertTo and LUT?

crackwitz · September 8, 2021, 10:33pm

OpenCV uses multithreading… if there is enough data to warrant the effort of spawning threads and distributing the work. if not, the call is running single-threaded.

that will give you “uneven” numbers because different things happened.

Topic		Replies	Views
OpenCV performance issues C++ imgproc	0	1587	June 19, 2022
Fastest CPU build configuration for simple fucntions? C++ build	3	310	October 1, 2021
New to C++, how do you use a LUT on a 3 channel image? C++	0	81	July 9, 2024
Mat::convertTo (); question on scaling C++ imgproc	4	355	April 7, 2024
OpenMP burning CPU cycles: cv::resize() is very slow unless cv::imshow() is used C++	4	179	June 15, 2024