Better times in sequencial vs parallel code?

I’m working on a project using OpenCV, where I apply a filter to an image and measure the time it takes for this process. I then collect the top 1% of the best results among different algorithms. However, I’m facing an issue where the execution times of the sequential and parallel algorithms seem contradictory.

The algorithms are being executed in the color_rgb.simd.hpp file in the cvtBGRtoGray function at line 1233. To measure the time, I used high_resolution_clock.

This is the sequential code:



uchar media;
long last_byte = (width * height) * 3;
for (long i = 0, j = 0; i < last_byte; i += 3, j++) {
    media = (src_data[i] + src_data[i+1] + src_data[i+2]) / 3;
    dst_data[j] = media;
}

And this is the parallel code:


const ponto *inicio = reinterpret_cast<const ponto*>(src_data);
const ponto *fim = reinterpret_cast<const ponto*>(src_data + (( width * height) * 3));
uchar *destino = dst_data;

std::transform(std::execution::par_unseq, inicio, fim, destino, [](const ponto &p) {
    uchar media = (p.r + p.g + p.b) / 3;
    return media;
});

I conducted tests with different-sized images. However, here are the results for a 256x256 pixel image:

Sequential Algorithm: 10,388 ns Parallel Algorithm: 10,834 ns

However, here are the results for a 625x615 pixel image:

Sequential Algorithm: 63,999 ns Parallel Algorithm: 62,299 ns

However, here are the results for a 1920x1080 pixel image:

Sequential Algorithm: 323,428 ns Parallel Algorithm: 327,229 ns

However, here are the results for a 7680x4320 pixel image:

Sequential Algorithm: 8,641,884 ns Parallel Algorithm: 8,782,140 ns

However, here are the results for a 15,360x8,640 pixel image:

Sequential Algorithm: 35,218,707 ns Parallel Algorithm: 35,070,783 ns

The test results are surprising because the parallel algorithm takes longer than the sequential one, which seems contradictory. When the situation is reversed, there is no significant optimization. I would like to understand why this is happening and if there is any optimization I can make to improve the performance of the parallel algorithm.

Any ideas about what might be causing this discrepancy in execution times? I appreciate any help or guidance in advance."

“longer”?

your numbers are indistinguishable, statistically speaking.

any proper course on parallel computing will explain to you that splitting work requires at least a minimal amount of coordination. that is not free.

I turned your several lines of prose-encoded data into a table. well, chatgpt did. if it made a mistake, I guess that’s acceptable.

I also added a column that reports parallel_time / sequential_time - 1

Image Size (pixels) Sequential Algorithm (ns) Parallel Algorithm (ns) Parallel Slower (%)
256x256 10,388 10,834 4.29%
625x615 63,999 62,299 -2.66%
1920x1080 323,428 327,229 1.18%
7680x4320 8,641,884 8,782,140 1.62%
15,360x8,640 35,218,707 35,070,783 -0.42%

Can you describe the hardware you are running on? For example, does it have 4 cores, and you were expecting appx 4x speedup? Based on the numbers it looks like the parallel and sequential run at about the same speed, which would make sense if you only had one core. If you have more than one core, I would agree that something doesn’t seem right.

this is spec of the machine is running:Architecture:

x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 140
Model name: 11th Gen Intel(R) Core™ i7-1165G7 @ 2.80GHz
Stepping: 1
CPU MHz: 2800.000
CPU max MHz: 4700,0000
CPU min MHz: 400,0000
BogoMIPS: 5606.40
Virtualization: VT-x
L1d cache: 192 KiB
L1i cache: 128 KiB
L2 cache: 5 MiB
L3 cache: 12 MiB
NUMA node0 CPU(s): 0-7

what did you measure and how did you measure it? hard facts please.

Although, in the original post i say that i use high_resolution_clock , that was in the begining phase. In the final test result that i post in this post i use the steady_clock.
this how i measure: in the begining the function cvtBGRtoGray in the file color_rgb.simd.hpp i put the start timer inside the if cycle " if (depth == CV_8U){“, then comment the original line of the openCV this line is where the opencv does is native process the image " CvtColorLoop(…)” then also comment the sequential code. then the parallel code does the work after that the end timer to calculate the time.
this to calculate the time in the parallel code. in order to calculate the sequential code i coment the parallel code, and undo the comment in sequential code.