Hi, folks!
I’m grappling with a tricky issue with opencv threading. I’m Fedora’s QA team lead, I maintain Fedora’s deployment of openQA, a screenshot match-based automated testing tool. openQA’s test runner backend, os-autoinst, uses OpenCV for low-level image stuff. It’s written in perl, but uses a small internal C++ library called “tinycv” to wrap OpenCV.
There have been issues in the past with os-autoinst processes blowing up due to a signal handling issue. The parent perl process can wind up sending SIGTERM or SIGCHLD signals to the OpenCV threads, which they don’t know how to handle.
So in os-autoinst we try to block these signals with a signal processing mask before spawning OpenCV threads, then unblock them again afterwards (so the parent process can receive these signals, which we need later). The current way we implement this is to set a sigprocmask, use cv::setNumThreads
to set a cap on the number of threads, and then pre-create all those threads using a parallel_for_
loop, so they all have the sigprocmask set. Then we unset the sigprocmask. The code for pre-creating the threads is at os-autoinst/ppmclibs/tinycv_impl.cc at 6c67d2feff2af0c4d85e325eb2bee6dbdffa6562 · os-autoinst/os-autoinst · GitHub ; it’s called right after the sigprocmask is set, and the sigprocmask is unset right after it’s called.
In Fedora’s deployment, we use the system opencv, which is a build of 4.9.0. It is configured to use tbb and built against the system tbb, which is 2021.11.0.
On that deployment, we are frequently hitting a case where this doesn’t quite work right. The thread pre-creation loop works and the threads it creates have the mask set, but then, later (it seems to be about 30 seconds later, in the case I examined anyway), after we unset the sigprocmask, one more thread gets created by tbb. We assume this is an OpenCV thread, as we don’t think anything else is using tbb, but I can’t prove this yet as I have not managed to get the issue to reproduce one time when running under strace
to trace the thread creation (which is another mystery, as it otherwise occurs fairly often). It doesn’t have the sigprocmask set, so it causes the crash if it receives a SIGTERM or SIGCHLD.
We cannot figure out what’s going on here. I’ve been poking at it in isotovideo frequently crashing in the signal handler stuff (especially on aarch64) · Issue #2549 · os-autoinst/os-autoinst · GitHub , but not getting very far. When I started digging into exactly what happens when we do cv::setNumThreads
it seems to get pretty fuzzy; it seems like opencv and tbb both have caps, they may have separate hard and soft caps, and it’s not clear to me whether the count includes the parent process or not.
os-autoinst sets the cap to whichever of cv::getNumThreads()
or cv::getNumberOfCPUs() - 1)
is lower. (This is explained as being “To avoid running into TBB’s soft limit which seems to be one thread less than the number of physical CPU threads (see TBB function calc_workers_soft_limit
)”).
On the box I’ve been testing it on, that number is 63, which is cv::getNumberOfCPUs() - 1)
. One thing I found is that the pre-creation loop seems to create only 62 new threads, because the first iteration of the parallel_for_
loop appears to run on the parent. So I thought that was the issue. But if I have the loop attempt to create one more thread, it seems to deadlock (the os-autoinst test suite times out, presumably because the process never completes startup). But since that’s the case, I’m baffled about how an extra thread seems to get created later.
Does anyone have any idea about what’s going on here and how to fix it? Or, failing that, any alternative ideas for dealing with the signal handling issue which might sidestep this problem? There’s a lot more detail in the issue linked above. Thanks a lot!