MergeMertens slows down in multithreading application


ich habe in einer applikation bei 4 unabhängigen threads das problem, dass mergemertens langsamer wird, je mehr threads aktiv sind.

Lasse ich mergemertens in einem thread laufen dann benötige ich 3400ms
Lasse ich den selben code in 4 unabhängigen threads gleichzeitig laufen, dann benötigt jeder thread 8300ms

Ich verwende hier einen Intel Core i9-10900K mit 64Gb Ram. Die Performance sollte also passen.
Programmiert habe ich mein Beispiel in C# mit Emgu, aber das sollte auch nciht das Problem sein.

Lässt sich das reproduzieren?
Habt jemand eine Erklärung für mich?

Hier noch der Link zu meinem Test Code: HDR

OpenCV already uses threads to accelerate computations.

thank you for my answer, but unfortunately that was not my question. Maybe I did not describe the problem clearly enough.

My app disables MultiThreading:
CvInvoke.NumThreads = 1;

I have 1 processing routine:

    private void Calc()
        Image<Rgb, byte> img1 = new Image<Rgb, byte>(@"HDR1.bmp");
        Image<Rgb, byte> img2 = new Image<Rgb, byte>(@"HDR2.bmp");
        Image<Rgb, byte> img3 = new Image<Rgb, byte>(@"HDR3.bmp");

        System.Diagnostics.Stopwatch watch = new System.Diagnostics.Stopwatch();

        List<Mat> array = new List<Mat>() { img1.Mat, img2.Mat, img3.Mat };

        var step1 = watch.ElapsedMilliseconds;

        //HDR Berechnung
        using (VectorOfMat images = new VectorOfMat(array.ToArray()))
            using (MergeMertens mergeMertens = new MergeMertens())
                Mat hdrResult = new Mat();

                //HDR Berechnung
                mergeMertens.Process(images, hdrResult);

        var step2 = watch.ElapsedMilliseconds;
        System.Diagnostics.Debug.Print("Step1: {0} Step2: {1}", step1, step2);

If I start the routine 1x, then the evaluation takes approx. 4s:
Thread thread1 = new Thread(Calc);

If I start 4 threads at the same time, the time is 7-8s
Thread thread1 = new Thread(Calc);

        Thread thread2 = new Thread(Calc);

        Thread thread3 = new Thread(Calc);

        Thread thread4 = new Thread(Calc);

Since my processor is strong enough, I would have expected a time loss, but not a factor of 2x.

For this I am looking for an explanation :slight_smile:

i rather think, this IS the problem.

i have no idea, what this does in c#, but it does NOT disable opencv’s internal parallelism in C++
(i also do not understand, how you want to start 3 threads after disabling this …)

in general, opencv has a lot of data-parallel optimization builtin. wrapping yet another thread-parallel algorithm around it rarely improves the situation

Thank you for your Info.

CvInvoke.NumThreads Property (
Get or set the number of threads that are used by parallelized OpenCV functions.

I would like to convert 4 different images at once. It is important that the evaluation runs as stable as possible.
If MultiThreading is active, then the evaluation fluctuates extremely, because too many threads are involved.
If I switch NumThreads to 1, then the values are much more stable.

How would you handle this in C++? Maybe this info will help me.

perform this experiment:

  1. you start no threads of your own. you do not set NumThreads. you operate on one image. measure the time.

  2. you start no threads of your own. you do set NumThreads, to 1. you operate on one image. measure the time.

restart the program for each test. do not put both tests in the same program.

what are the times?

it was but my answer may not make sense to you.

do you understand that opencv may be starting its own threads, many of them? don’t be sure that NumThreads necessarily has an effect. it’s supposed to, but don’t rely on it.

do you understand that if you start four calls, and each call uses several threads (probably as many as your CPU has cores), that WILL create more threads than can be executed independently on your CPU?

besides… you need to be a lot more exact in stating what you measure. did you measure the time for each individual call, and those four times are around 8 seconds each? or did you measure the time to complete all four calls?

if one call takes 4 seconds, and 4 calls in parallel take 8 seconds total, you saved time already, because 4x 4 seconds would have been 16 seconds.

all of this is multithreading basics.

Thank you for your feedback!
I unserstand that one call may generate several threads. I try to unserstand if this is controllable or a as it is situation.

I always measure the time of one call. 4 calls in parallel → each takes 8s → my total time is about 8 seconds.

To your question. Every test the application is started again:

  1. NumThreads not set, 1 single operation: 2827ms
  2. NumThreads = 1, 1 single operation: 4369ms
  3. NumThreads = 1, 4 parallel operations of different images: 8037ms, 8051ms, 8111ms, 8129ms
  4. NumThreas not set, 4 parallel operations of different images: 5080ms, 8173ms 9660ms, 9660ms

My interpretation is:

  • NumThreads has a function because of the difference between 1. + 2.
  • I can also see that 4. overloads the CPU because of 4x (max threads). Thats why the results of 3. are more stable

But I do not understand why 2 is much slower because in theory I have enough cores in my system. I am not sure if I am doing something wrong.

Is there a way to find our how many threads are generated by one call?