Opencv 4.11.0 on baremetal

tried yolov4_tiny 320*320 pretrained model, run on my pc by cpu in opencv 4.11.0 takes 380-690ms.
while run on my zynq single core cpu 750Mhz DDR 533Mhz with L1 L2 cache on, read and write via sdcard, no FPGA acceleration no neon instructio, if compiled by O0, took 120s. O2 took 60s, compileation with O3 neon vector, took 30s.

plan to try tensor flow lite model to see howslow it is.