OpenMP burning CPU cycles: cv::resize() is very slow unless cv::imshow() is used

That’s right. cv::resize() takes 2ms when the output then is shown using cv::imshow(). Removing the imshow() call slows down resize 10 to 20ms.

What could be going on?

wow, magic :slight_smile:

please add details (os, ocv version, gui backend) and MRE
so we can cross-check, ty.

Minimal Reproducible Example

It turns out doing the resize in a separate thread is important to reproduce the slow resize behavior. In the main thread, resize is fast, but much slower if done in another thread, even if the main thread is not doing anything.

#include <chrono>
#include <cstdio>
#include <opencv2/core/mat.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/videoio.hpp>
#include <opencv4/opencv2/highgui.hpp>
#include <thread>

void resize(cv::Mat &img_in, cv::Mat &img_out) {
    auto start = std::chrono::steady_clock::now(); 
    
    cv::resize(img_in, img_out, cv::Size(640, 480), 0, 0, cv::INTER_AREA);

    auto end = std::chrono::steady_clock::now();
    long elpased_ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    printf("took %ldms\n", elpased_ms);
}

void loop() {
    cv::Mat img_in(1080, 1920, CV_8UC3, cv::Scalar(255, 255, 255));
    cv::Mat img_out(480, 640, CV_8UC3);
    
    for (;;) {
        resize(img_in, img_out);
    }
}

int main() {
    bool use_thread = true;

    if (use_thread) {
        std::thread thread(loop);
        thread.join();
    } else {
        loop();
    }
}

My output with use_thread = true:

...
took 26ms
took 26ms
took 27ms
took 26ms
took 26ms
took 27ms
took 26ms
took 26ms
took 19ms
took 27ms

My output with use_thread = false:

...
took 3ms
took 2ms
took 3ms
took 3ms
took 3ms
took 2ms
took 3ms
took 2ms
took 3ms
took 3ms
took 3ms

OS Information

System:
  Kernel: 5.15.0-107-generic x86_64 bits: 64 compiler: gcc v: 11.4.0 Desktop: Cinnamon 6.0.4
    tk: GTK 3.24.33 wm: muffin vt: 7 dm: LightDM 1.30.0 Distro: Linux Mint 21.3 Virginia
    base: Ubuntu 22.04 jammy

OpenCV version is 4.9.0 from NixOS with GTK2 enabled.

If I’m interpreting the profile correctly, in the slow version (1.6s runtime) 97% of the time is spent in gomp_team_barrier_wait_end and gomp_barrier_wait_end when the slow version runs.
2.7% is spent in cv::ResizeArea_Invoker<unsigned char, float>::operator()(cv::Range const&) const.

In the fast version (0.5s runtime), 9.4% is spent in cv::ResizeArea and 90% in the gomp functions.

Reducing the number of threads usingcv::setNumThreads() to something reasonable like 10 is a workaround.

There are no gui functions in my minimal example.

Does the example reproduce for you?