Weird speed performance of convertTo and LUT

What I want to do

I’m doing some CV deeplearning deployment optimization.Some of my models need their input to be normalized. So preprocessing normalization became an optimization point. Pixel value is between [0, 255], so first thing is multiplying it with 1 / 255.0, which is my first method. After some google I found LUT which theoretically should be faster than float calculation. So I wrote code like below to test the two methods:

Test code

#include "opencv2/imgcodecs/imgcodecs.hpp"
#include "opencv2/imgproc/imgproc.hpp"
#include <chrono>
#include <dirent.h>
#include <iostream>
#include <string>
#include <vector>

int getFiles(const std::string path, std::vector<std::string>& files, std::string suffix)
{
    int iFileCnt = 0;
    DIR* dirptr = NULL;
    struct dirent* dirp;

    if ((dirptr = opendir(path.c_str())) == NULL) {
        return 0;
    }
    while ((dirp = readdir(dirptr)) != NULL) {
        if ((dirp->d_type == DT_REG) && 0 == (strcmp(strchr(dirp->d_name, '.'), suffix.c_str()))) {
            files.push_back(dirp->d_name);
        }
        ++iFileCnt;
    }
    closedir(dirptr);

    return iFileCnt;
}

int main(int argc, char* argv[])
{
    std::string pic_dir = argv[1];
    int loop_count = 10;
    if (argc >= 3) {
        loop_count = std::stoi(argv[2]);
    }
    float FACTOR = 1 / 255.0;

    std::vector<cv::Size> sizes = {
        {299, 299},
        {416, 416},
        {512, 512},
        {640, 640},
        {960, 540},
        {1920, 1080}
    };
    // std::vector<cv::Size> sizes = {
    //     {1920, 1080},
    //     {960, 540},
    //     {640, 640},
    //     {512, 512},
    //     {416, 416},
    //     {299, 299}
    // };

    cv::Mat table(1, 256, CV_32FC1);
    auto ptr = table.ptr<float>(0);
    for (int i = 0; i < 256; ++i) {
        ptr[i] = float(i) * FACTOR;
    }

    std::vector<std::string> pic_files;
    getFiles(pic_dir, pic_files, ".jpg");
    std::vector<cv::Mat> image_mats(pic_files.size());
    for (int i = 0; i < pic_files.size(); ++i) {
        std::string one_pic_path = pic_dir + "/" + pic_files[i];
        image_mats[i] = cv::imread(one_pic_path);
    }

    for (auto& one_size : sizes) {
        std::cout << "size: " << one_size << std::endl;

        double time_1 = 0;
        double time_2 = 0;
        for (auto& one_mat : image_mats) {
            cv::Mat tmp_image;
            cv::resize(one_mat, tmp_image, one_size);
            for (int i = 0; i < loop_count; ++i) {
                auto t_1_1 = std::chrono::steady_clock::now();
                cv::Mat out_1;
                tmp_image.convertTo(out_1, CV_32FC3, FACTOR);
                auto t_1_2 = std::chrono::steady_clock::now();
                time_1 += std::chrono::duration<double, std::milli>(t_1_2 - t_1_1).count();

                auto t_2_1 = std::chrono::steady_clock::now();
                cv::Mat out_2;
                cv::LUT(tmp_image, table, out_2);
                auto t_2_2 = std::chrono::steady_clock::now();
                time_2 += std::chrono::duration<double, std::milli>(t_2_2 - t_2_1).count();

                auto diff = cv::sum(out_1 - out_2);
                if (diff[0] > 1E-3) {
                    std::cout << diff << std::endl;
                }
            }
        }
        size_t count = loop_count * image_mats.size();
        auto average_time_1 = time_1 / count;
        auto average_time_2 = time_2 / count;
        auto promote_percent = (average_time_1 - average_time_2) / average_time_1 * 100;
        printf("total pic num: %d, loop %d times\n", pic_files.size(), loop_count);
        printf("method_1, total  %f  ms, average  %f ms\n", time_1, average_time_1);
        printf("method_2, total  %f  ms, average  %f ms, promote: %.2f%\n", time_2, average_time_2,
               promote_percent);
        printf("\n");
    }

    return 0;
}

Weird performance

What I want to test is speed difference between two methods, with different input sizes, while the outputs of two methods should be equal. I took 128 pictures with different sizes for test. Here is the weird performance:

1. result of the code above

size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total  38.872174  ms, average  0.030369 ms
method_2, total  330.688332  ms, average  0.258350 ms, promote: -750.71%

size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total  103.708926  ms, average  0.081023 ms
method_2, total  689.972421  ms, average  0.539041 ms, promote: -565.30%

size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total  267.989430  ms, average  0.209367 ms
method_2, total  450.809036  ms, average  0.352195 ms, promote: -68.22%

size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total  757.269510  ms, average  0.591617 ms
method_2, total  551.951118  ms, average  0.431212 ms, promote: 27.11%

size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total  1095.167540  ms, average  0.855600 ms
method_2, total  760.330269  ms, average  0.594008 ms, promote: 30.57%

size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total  4944.142104  ms, average  3.862611 ms
method_2, total  3471.176202  ms, average  2.711856 ms, promote: 29.79%

2. comment the diff part:

//auto diff = cv::sum(out_1 - out_2);
//if (diff[0] > 1E-3) {
//    std::cout << diff << std::endl;
//}
size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total  246.356823  ms, average  0.192466 ms
method_2, total  361.859598  ms, average  0.282703 ms, promote: -46.88%

size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total  516.542233  ms, average  0.403549 ms
method_2, total  719.191240  ms, average  0.561868 ms, promote: -39.23%

size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total  839.599260  ms, average  0.655937 ms
method_2, total  342.608080  ms, average  0.267663 ms, promote: 59.19%

size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total  1384.348467  ms, average  1.081522 ms
method_2, total  524.382672  ms, average  0.409674 ms, promote: 62.12%

size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total  1796.153597  ms, average  1.403245 ms
method_2, total  688.210851  ms, average  0.537665 ms, promote: 61.68%

size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total  7707.945924  ms, average  6.021833 ms
method_2, total  3812.262622  ms, average  2.978330 ms, promote: 50.54%

3. Uncomment the diff part but reverse the sizes vector

std::vector<cv::Size> sizes = {
        {1920, 1080},
        {960, 540},
        {640, 640},
        {512, 512},
        {416, 416},
        {299, 299}
    };

...

auto diff = cv::sum(out_1 - out_2);
if (diff[0] > 1E-3) {
   std::cout << diff << std::endl;
}

size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total  4933.384896  ms, average  3.854207 ms
method_2, total  3563.611341  ms, average  2.784071 ms, promote: 27.77%

size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total  887.353187  ms, average  0.693245 ms
method_2, total  917.995079  ms, average  0.717184 ms, promote: -3.45%

size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total  492.562282  ms, average  0.384814 ms
method_2, total  525.089826  ms, average  0.410226 ms, promote: -6.60%

size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total  181.900041  ms, average  0.142109 ms
method_2, total  159.691528  ms, average  0.124759 ms, promote: 12.21%

size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total  77.030586  ms, average  0.060180 ms
method_2, total  221.307936  ms, average  0.172897 ms, promote: -187.30%

size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total  38.139366  ms, average  0.029796 ms
method_2, total  112.203023  ms, average  0.087659 ms, promote: -194.19%

4. Comment the diff part and reverse the sizes vector

std::vector<cv::Size> sizes = {
        {1920, 1080},
        {960, 540},
        {640, 640},
        {512, 512},
        {416, 416},
        {299, 299}
    };

...

//auto diff = cv::sum(out_1 - out_2);
//if (diff[0] > 1E-3) {
//   std::cout << diff << std::endl;
//}

size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total  8021.875493  ms, average  6.267090 ms
method_2, total  3849.222334  ms, average  3.007205 ms, promote: 52.02%

size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total  605.553580  ms, average  0.473089 ms
method_2, total  477.145896  ms, average  0.372770 ms, promote: 21.21%

size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total  268.076975  ms, average  0.209435 ms
method_2, total  169.015667  ms, average  0.132043 ms, promote: 36.95%

size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total  117.419851  ms, average  0.091734 ms
method_2, total  94.436479  ms, average  0.073778 ms, promote: 19.57%

size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total  73.963177  ms, average  0.057784 ms
method_2, total  221.397616  ms, average  0.172967 ms, promote: -199.33%

size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total  38.046131  ms, average  0.029724 ms
method_2, total  113.839007  ms, average  0.088937 ms, promote: -199.21%

Question

I know cpu working state may undulate, and speed performance may not be exactly the same. But Why does code outside timekeeping have such influence on convertTo and LUT?

OpenCV uses multithreading… if there is enough data to warrant the effort of spawning threads and distributing the work. if not, the call is running single-threaded.

that will give you “uneven” numbers because different things happened.