OpenCV performance issues

The simple task at hand is to take RAW image of a camera, debayer it, and multiply every pixel with some value (say 1.5) because the image is a bit dark. Easy.

So I wrote a benchmark in C++ to test the performance. In a loop, my benchmark test

  • invokes cvtColor with COLOR_BayerBG2RGB on the RAW input image and
  • invokes convertScaleAbs with alpha=1.5 and beta=0.0 on the RGB888 image.

I measure how often per second I can do one of the operations or both combined. The input image is 20Megapixel (5000x4000) 8bit.

To my surprise, cvtColor was blazingly fast. Debayering is not a simple task. The result was 400fps. It didn’t talk long until I noticed that debayering was using all of my 24 CPU cores. Calling setNumThreads(12) even improved the speed to 480fps.

The next surprise was that convertScaleAbs is relatively slow. It uses a single CPU core only. The result was just 127fps. Together, both operation clock in at only 100fps (12 cores) and 95fps (24 cores). As you can see, convertScaleAbs is the limiting factor here.

Using opencv’s parallel_for_ I have stitched together a parallel version of convertScaleAbs. It processes the image in getNumThreads()-many row ranges. It achieves 900fps (24 cores and 12 cores). I don’t know why it performs the same regardless of the number of cores used. Probably, I’m hitting some sort of memory bandwidth limit.

Together, the parallel debayering and the parallel scaling clock in at 270fps (24 cores) and 300fps (12 cores). This is roughly what we would expect: 1/( 1/480 + 1/900 ) = 313.

The first issue I have with OpenCV is the implicit multithreading.
Which functions will automatically be parallel? Which function should I (and am I allowed to) wrap inside a parallel_for? Why has cvtColor been chosen to be parallelized while other functions have not? As you can clearly see, the parallel version of convertScaleAbs has benefits. Is cvtColor completely parallelized, or just the debayering? Where can I read about it?
By default opencv seems to be overambitious. Reducing the number of threads actually improves performance, so the defaults seem to be a poor choice for my system (my CPU is a Ryzen 9 5900X).
Also, I find the lack of control disturbing. Eventually I might run 4 image processing pipelines in parallel. The overall throughput may be satisfactory, but latency may not. A single debayering operation of one pipeline will hog the entire threadpool for a while. It may be advantageous to have small independent thread pools per pipeline. But I see no way to specify a thread pool or the number of threads to be used by a cvtColor operation. Can this be achieved?

If it is possible, I would like to take explicit control over multithreading.
I would like to have separate threadpools for each processing pipeline and I would like to have control how many threads each pipeline uses.

I already tried to use OpenCL from multiple threads. If I call setNumThreads(12) and have 3 of my threads spinning on debayering, then I observe that 14 cores are being used. I would make the following guess: setNumThreads(12) creates a threadpool of 11 threads that are used in addition to the threads that are starting the debayering. So in total we end up with 14 threads. That’s somewhat counter-intuitive. I already got some openCL related error messages when using multiple threads. Is openCV expecting me to process all data in a single thread?

Any hints are appreciated. Thanks in advance!