Improving FPS: optimization and multi-threading

I’m working on code that captures images from an instrument, analyzes them, then displays the images on a monitor with some information (lines, numbers, text) added to the frames. The images are coming from a USB camera, 2592 x 1944. To analyze I need all the pixels, but for subsequent display it’s OK to shrink by a linear factor of 2 to 3.

I first wrote code that grabbed frames with cv.VideoCapture(), did the analysis with numpy and scipy, made a reduced-size frame for display with cv.resize(), drew geometry on the frames, annotated them with cv.putText() and lastly displayed them with cv.imshow(). It was very slow (2 frames/sec) and the latency was seconds.

This sped up to about 4 frames/sec when I changed the backend to V4L2. I then did some experiments to try and understand what was limiting the speed.

If cv.VideoCapture() is in a busy loop , I can get 18 frames/sec from the camera. So I put this into a thread by itself. I then put the frame analysis in a second thread, and the imshow() in a third thread. (Hardware is 4 cores.)

In the end, I was able to get about 10 frames/sec, with 260% load on a 4-core CPU. That’s not quite as good as I would like, but it’s OK. (Note: I’m using opencv compiled for all cores, so cv.getNumThreads() returns 4 and cv.checkHardwareSupport(100) returns True, where 100==CV_CPU_NEON.)

The problem: there is far too much latency – about 2 seconds (not a typo!). So if I jiggle the instrument, it takes 2 seconds at 10fps before the image reacts. I’m not sure where that latency is coming from - reducing the buffer size in VideoCapture() does not help.

My question: is this latency coming from the python layer? Or will shifting to C++ not fix it?

Before someone says “it’s not possible on your hardware, you need better hardware”, I have proof that the hardware is sufficient. I have downloaded and installed guvcview on the same system (Raspberry pi 4 b, four x 64-bit NEON cores). This program, guvcview, uses the same V4L2 library as opencv, and displays 17 fps from the same camera, in real time at 2592 x 1944 with no observable latency, using one core pegged to 100%.


latency is to be expected if you can’t process as quickly as the camera makes frames. it makes them at its own pace. you can’t throttle it. if you try, frames will queue up.

you must read every frame. if you don’t have the time to process it, discard it.

obviously your camera produces 17-18 fps. that is not in question. the processing you do is in question.

My main issue is latency. The 10 frames/second rate which I reached in the python code is acceptable for my purposes, although not as good as guvcview which happily runs at 18 fps.

Yesterday I wrote a C++ version of my opencv code, and got about the same 10 fps rate as the python version. However I was able to fix the latency problem in the C++ version, by setting CAP_PROP_BUFFERSIZE = 1. If I increase that value (max was 10, I think) that speeds things up to about 12 fps, but creates terrible latency.

For reasons that I do not understand, setting CAP_PROP_BUFFERSIZE in the python version seems to have no effect.

The default buffer size in the v42l driver is 4, with a max value of 10. At my frame rates, these create 400 - 1000 msec of lag.

I think you are saying that your performance is good enough from a throughput standpoint (10 fps), but unacceptable from a latency standpoint. Also you are getting lower performance in python vs C++, and in Python the CAP_PRO_BUFFER_SIZE seems to have no effect, which makes your latency even worse.

My thoughts, in no particular order.

  1. I would prefer to use the V4L2 interface to control the camera so you can directly control all of the settings - specifically frame rate, image format (if applicable) and number of buffers.
  2. Your latency is a combination of the number of buffers and the processing time. I would do whatever you can to reduce the number of buffers in play - if you aren’t able to limit the number of buffers directly (either through V4L2 or some interface that honors your requested number of buffers) I would consider doing:
  • Limit your frame rate, if the camera supports it. If your camera has a 10 FPS mode, and you are able to process frames at slightly higher than 10FPS, then you should be able to avoid buffered frames piling up.
  • If you can’t limit the frame rate from the camera, drop frames as they are received until you are done processing the current frame. This isn’t the best for overall performance because you are using bandwidth to transfer frames that you don’t need, but it will help with latency.
  • Starve the camera by delaying the return of the buffers. This might be out of your control if you aren’t able to edit/compile the code that handles the camera, but basically you want to delay the call to VIDIOC_QBUF. If you are forced to have 10 buffers, you hold 8 of them on the user side and feed them back to the camera so that the camera only ever has 2 at a time, effectively decreasing the number of buffers to 2 (or whatever you want).
  1. Set up two (or more?) threads to process incoming images. It will take a little bit of coordination, but one thread reading images from the camera, and two threads processing them - one doing even numbered frames, the other doing odd numbered frames, for example. Then one thread doing the downscaling/annotation/display, perhaps. (or whatever makes sense)

Also be thoughtful / aware about how the NEON resources are being used. On the hardware I’m familiar with, the NEON instructions share compute resources in some way that can stall other threads/cores(?), I think even if the “other” is only doing normal floating point (non-NEON) ops. I don’t remember the details, and it might not apply to your architecture…but what I recall is that using NEON instructions on one thread hurt the performance of another thread. (I’m pretty fuzzy on this, so don’t take it at face value.)