Is it the case that the in-place RGB->BGR color conversion routine in OpenCV saves some memory, but takes longer? If yes, can anyone explain why?
My application calls the cv::cvtColor(srcMat, dstMat, cv::COLOR_RGB2BGR) routine in OpenCV (version 4.2.0). In an effort to make the application faster, I tried the in-place version of this routine (by invoking it with the same Mat object for source and destination). I expected the speed to slightly improve, since the in-place version does not allocate new memory.
To test my expectation, I ran my application in a loop over 10,000 250x250 RGB images. To my surprise, my application became slower when the in-place version was used. In fact, I saw that the larger the image (500x500 vs 250x250), the greater the difference between the in-place and regular version.
Is this expected? If so, is it because the in-place version does a swap operation (more statements) and the regular version is only a copy operation?
Would anyone be willing to try to reproduce this behavior? It can be done easily by timing the following snippet in 2 different ways: 1) use the snippet below, and 2) following the brief instructions in the comments in the snippet for the in-place version.
// Read image
Mat srcMat = imread(filename);
// Comment out this line for the in-place version
for (int i=0; i<10000; i++)
// Use srcMat instead of dstMat in the in-place version
cv::cvtColor(srcMat, dstMat, cv::COLOR_RGB2BGR);
I can’t say if it’s “expected”. you’d have to read the source to find truths.
it’s not “unexpected”.
I can speculate that it’s a concurrent function, i.e. uses multiple threads.
CPUs contain cache. cache contains “cache lines”. that’s the granularity of the cache. when one thread reads and another writes to memory in the same cache line, you get “cache contention”.
random article I found that contains “false sharing”: Understanding Cache Contention and Cache Profiling Metrics - Oracle® Developer Studio 12.6: Performance Analyzer Tutorials
I would hope that for such small array sizes it does not distribute the work across multiple threads…
you can try
cv::setNumThreads: OpenCV: Utility and system functions and macros
this also lists the handful of backends OpenCV can use for parallelism. it seems that for OpenMP you might have to use an environment variable.
it would be a good idea to use a destination matrix different from source matrix, and keep them both around to be reused (written into) on the next iteration. that saves allocation/deallocation, if you’re worried about that.
most OpenCV functions (particularly cvtColor) will resize an argument if its size is unsuitable (leave it otherwise) so you don’t have to think about giving output matrices a size when they’re filled from OpenCV functions anyway.
setNumThreads(1). timing is still different: in-place ~120 µs, regular 73 µs.
setUseOptimized(false), the inequality remains.
I don’t know what it does when it’s told to work in-place. the code might do something unexpected.
Thank you @crackwitz.
Following up, I have now tried a destination matrix that persists across iterations (as you suggested). The performance of this new version is better than the in-place version (where srcMat and dstMat are the same), but it’s still slightly worse than the regular version. I’m still investigating all of this, but so far, it seems that not allocating new memory for the destination matrix isn’t improving performance.
unclear. more precision please.
My application is ~5% slower when dstMat persists across iterations (but different from srcMat), compared to when a new dstMat is allocated for every loop.