OpenCV's T-API facedetect.cpp sample is hella slow and CPU intensive with OpenCL "on"

Hello,

The “ufacedetect.cpp” from T-API folder in samples folder runs super slow after compilation. When running the compiled code, OpenCL can be seen displayed as “ON” while running the compiled code. There is no difference in performance, OpenCL does not offload CPU usage.

System information (version)
opencv-4.5.5_9
OS: FreeBSD 13.1-RELEASE-p1 amd64
Resolution: 3840x2160
DE: Plasma 5.24.6
WM: KWin
Theme: [Plasma], Breeze [GTK2/3]
Icons: [Plasma], breeze-dark [GTK2/3]
Terminal: konsole
CPU: AMD FX-8350 (8) @ 3.991GHz
GPU: Ellesmere [Radeon RX 580]
Memory: 11708MiB / 32684MiB
Compiler => FreeBSD clang version 13.0.0 (git@github.com:llvm/llvm-project.git llvmorg-13.0.0-0-gd7b669b3a303)

Here is the output for clinfo:

here is the output for opencv_version --opencl:

 4.5.5
OpenCL Platforms: 
    Clover
        dGPU: AMD Radeon RX 580 Series (POLARIS10, DRM 3.35.0, 13.1-RELEASE-p1, LLVM 13.0.1) (OpenCL 1.1 Mesa 21.3.8)
    Portable Computing Language
        CPU: AMD FX(tm)-8350 Eight-Core Processor            (OpenCL 1.2 pocl HSTR: pthread-x86_64-portbld-freebsd13.1-bdver2)
Current OpenCL device: 
    Type = dGPU
    Name = AMD Radeon RX 580 Series (POLARIS10, DRM 3.35.0, 13.1-RELEASE-p1, LLVM 13.0.1)
    Version = OpenCL 1.1 Mesa 21.3.8
    Driver version = 21.3.8
    Address bits = 64
    Compute units = 36
    Max work group size = 256
    Local memory size = 32 KB
    Max memory allocation size = 3 GB 204 MB 819 KB 204 B
    Double support = Yes
    Half support = Yes
    Host unified memory = No
    Device extensions:
        cl_khr_byte_addressable_store
        cl_khr_global_int32_base_atomics
        cl_khr_global_int32_extended_atomics
        cl_khr_local_int32_base_atomics
        cl_khr_local_int32_extended_atomics
        cl_khr_int64_base_atomics
        cl_khr_int64_extended_atomics
        cl_khr_fp64
        cl_khr_extended_versioning
    Has AMD Blas = No
    Has AMD Fft = No
    Preferred vector width char = 16
    Preferred vector width short = 8
    Preferred vector width int = 4
    Preferred vector width long = 2
    Preferred vector width float = 4
    Preferred vector width double = 2
    Preferred vector width half = 0

Thanks for any help.