I think you are saying that your performance is good enough from a throughput standpoint (10 fps), but unacceptable from a latency standpoint. Also you are getting lower performance in python vs C++, and in Python the CAP_PRO_BUFFER_SIZE seems to have no effect, which makes your latency even worse.
My thoughts, in no particular order.
- I would prefer to use the V4L2 interface to control the camera so you can directly control all of the settings - specifically frame rate, image format (if applicable) and number of buffers.
- Your latency is a combination of the number of buffers and the processing time. I would do whatever you can to reduce the number of buffers in play - if you aren’t able to limit the number of buffers directly (either through V4L2 or some interface that honors your requested number of buffers) I would consider doing:
- Limit your frame rate, if the camera supports it. If your camera has a 10 FPS mode, and you are able to process frames at slightly higher than 10FPS, then you should be able to avoid buffered frames piling up.
- If you can’t limit the frame rate from the camera, drop frames as they are received until you are done processing the current frame. This isn’t the best for overall performance because you are using bandwidth to transfer frames that you don’t need, but it will help with latency.
- Starve the camera by delaying the return of the buffers. This might be out of your control if you aren’t able to edit/compile the code that handles the camera, but basically you want to delay the call to VIDIOC_QBUF. If you are forced to have 10 buffers, you hold 8 of them on the user side and feed them back to the camera so that the camera only ever has 2 at a time, effectively decreasing the number of buffers to 2 (or whatever you want).
- Set up two (or more?) threads to process incoming images. It will take a little bit of coordination, but one thread reading images from the camera, and two threads processing them - one doing even numbered frames, the other doing odd numbered frames, for example. Then one thread doing the downscaling/annotation/display, perhaps. (or whatever makes sense)
Also be thoughtful / aware about how the NEON resources are being used. On the hardware I’m familiar with, the NEON instructions share compute resources in some way that can stall other threads/cores(?), I think even if the “other” is only doing normal floating point (non-NEON) ops. I don’t remember the details, and it might not apply to your architecture…but what I recall is that using NEON instructions on one thread hurt the performance of another thread. (I’m pretty fuzzy on this, so don’t take it at face value.)