Unable to get hardware acceleration for YuNet on Windows in Python-Opencv

Hi there,

I’m building a small project around opencv in Python. The goal is to do some exploration around computer vision inside a custom made python library. I have also built a small GUI that use that library to explore the computer vision subject.

I’m trying to implement the YuNet face detector. It works fine, but the issue is looks to be run on CPU and not on GPU.

I’m using the opencv transparent API, which means that I’m receiving an UMat frame and doing most of the processing using the UMat object which is supposed to be run on the GPU using the opencl bindings.

But the examples I found on the internet does not seems to use the UMat but the regular ndarray (MatLike) object. Which means in the code in the self.detecor.detect, it returns an UMat shape that needs to be converted back to MatLike faces.get() since we cannot loop over an UMat object.

Here’s the code for information:

class YUNetFaceDetectionFilter(ImageProcessingDecorator):
    """A class representing a YUnet DNN face detection filter for image processing."""

    def __init__(self, wrapped: ImageProcessingStrategy) -> None:
        """Initialize the YUNetFaceDetectionFilter.

        Args:
            wrapped (ImageProcessingStrategy): The wrapped image processing strategy.
        """
        super().__init__(wrapped)
        self.detector = cv2.FaceDetectorYN.create(
            "data/face_detection_yunet_2023mar.onnx", "", (0, 0)
        )

    def process(self, _frame: Image) -> UMat:
        """Process an image.

        Args:
            frame (UMat): The image to process.

        Returns:
            UMat: The processed image.
        """
        # face_detection_yunet_2023mar.onnx
        frame = super().process(_frame)
        frame_mat = frame.get()
        heigh, width, _ = frame_mat.shape
        self.detector.setInputSize((width, heigh))

        _, faces = self.detector.detect(frame)

        if faces is None:  # type: ignore
            return frame

        try:
            for face in faces.get():
                # bounding box
                box = list(map(int, face[:4]))
                color = (0, 255, 0)
                cv2.rectangle(frame, box, color, 2)

                # confidence
                confidence = face[-1]
                confidence = "{:.2f}".format(confidence)
                position = (box[0], box[1] - 10)
                cv2.putText(
                    frame,
                    confidence,
                    position,
                    cv2.FONT_HERSHEY_SIMPLEX,
                    0.5,
                    color,
                    1,
                    cv2.LINE_AA,
                )
        except TypeError:
            pass

        return frame

The class is used in a layered system where we can apply mulitple filters on a frame. But here for instance, the YUNetFaceDetectionFilter is used alone without any other filters.

And the actual CPU usage:

This gives me an actual frame rate of ~= 15 FPS

For comparison, I implemented the Haar Cascade Detection:

class HaarCascadeFaceDetectionFilter(ImageProcessingDecorator):
    """A class representing a Haar cascade face detection filter for image processing."""

    def __init__(self, wrapped: ImageProcessingStrategy) -> None:
        """Initialize the HaarCascadeFaceDetectionFilter.

        Args:
            wrapped (ImageProcessingStrategy): The wrapped image processing strategy.
        """
        super().__init__(wrapped)
        self.face_cascade = cv2.CascadeClassifier(
            "data/lbpcascade_frontalface.xml"  # type: ignore
        )

    def process(self, _frame: Image) -> UMat:
        """Process an image.

        Args:
            frame (UMat): The image to process.

        Returns:
            UMat: The processed image.
        """
        frame = super().process(_frame)
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        faces = self.face_cascade.detectMultiScale(gray, 1.3, 5)
        for x, y, w, h in faces:
            cv2.rectangle(frame, (x, y), (x + w, y + h), (255, 0, 0), 2)
        return frame

And there the GPU is much more used while the CPU usage remains low and the actual frame rate is 30 FPS, which is the maximum for my webcam.

I suspect the bottleneck is somewhere in the following code:

        frame_mat = frame.get()
        heigh, width, _ = frame_mat.shape
        self.detector.setInputSize((width, heigh))

        _, faces = self.detector.detect(frame)

        if faces is None:  # type: ignore
            return frame

        try:
            for face in faces.get():

But most of the example I found does not seems to use the transparent API.
If anybody has an idea, I will be glad :slight_smile:

Thank you very much.

The CPU/GPU usage for HaarCascade filter:

You can clearly see that CPU remains constants while there’s a spike in the GPU usage when running the application.

https://docs.opencv.org/4.x/df/d20/classcv_1_1FaceDetectorYN.html

you could at least try to specify target_id=1

to make it use openCL internally.

imo, using UMat as input for this is just ‘wishful thinking’, the code works on plain (cpu) Mats

Hello,

Seems to improve the situation. Most of the work is done on the GPU side now.

One thing I don’t understand is that it still run at a very slow FPS ~= 8. The image looks sluggish.

I have also the following error which appeared:

[ WARN:0@1.129] global ocl4dnn_conv_spatial.cpp:1933 cv::dnn::ocl4dnn::OCL4DNNConvSpatial<float>::loadTunedConfig OpenCV(ocl4dnn): consider to specify kernel configuration cache directory through OPENCV_OCL4DNN_CONFIG_PATH parameter.
OpenCL program build log: dnn/dummy
Status -66: CL_INVALID_COMPILER_OPTIONS
-cl-no-subgroup-ifp -D AMD_DEVICE

Might explain the problem. Incompatibility?

Hello there ! :smiley:

I managed to achieve 16 FPS with the following settings:

self.detector = cv2.FaceDetectorYN.create(
    "data/face_detection_yunet_2023mar.onnx",
    "",
    (0, 0),
    backend_id=cv2.dnn.DNN_BACKEND_DEFAULT,
    target_id=cv2.dnn.DNN_TARGET_OPENCL,
)

Using OpenCL has the default target sounds to work well, I do see now that most of the processing is done by the GPU.

I don’t know if we could improve this even more.

All of this is running on the laptop using a 11th Intel i7-11850H and a discrete GPU. So I don’t know if we could expect more from this :slight_smile:

Still doing some investigation.

Regards,

1 Like