FastAPI web service for parallel processing of images

i’m building a service with fastapi, which accepts post requests consisting of zip files. and those zip files consists of images. now, i have to process the the zipfiles, extract faces from each image, [face detection using YuNet is used here] and all the faces of an image is stored in a seperate folder. now i’ll maintain a unique face folder, and apply recognition techniques and store the unique faces in unique_faces_folder. i have the flow with me, but when i get concurrent requests how can i handle them? i also used celery tasks for multi processing but when i load test the service using k6, having 10 requests [users] at a time for 10 min there are 88% failed requests. i have also used workers to be 5 in my case for fastapi application. so is there any solution for this? i want to acheive atleast 50 per second.
yes i know it depends on no.of images and zip file,
what i have is, zip file of 10 images, each image containing 15-16 faces.
i’ve used concuurent.futures for multi processing of single zip file. [ and i’m using only cpu]

Any suggestions which improves the performance are highly appreciated.

I don’t know anything about fastapi or what your workflow looks like in Python, etc.

My first thought is about how you are storing the zip files and the unzipped images. If you can use a ramdisk for storage that might help. Or if not, then can you unzip them to memory instead of unzipping to files on disk?

How many processors / cores do you have to work with?

When you say 88% failed requests, do you mean the post requests fail / don’t get serviced in time? Can you prioritize servicing of the post request (just receiving the data) and defer processing of the data (unzipping, detecting faces).

Have you profiled your workflow to understand what is taking the most time, and to get an idea of what your maximum throughput is? For example, how many files per second can you unzip and run face detection on if that’s the only thing you are doing? Maybe 50 per second (or whatever your target is) isn’t achievable even if everything is perfectly streamlined?

Hi @Steve_in_Denver,
First Thing regarding the saving the zip file, we get a post request to the server{end point} we extract the form data, add save the zip file contents temporarily. code will be below,

async def face_recognition(request: Request):
    form_data = await request.form()
    logging.info('buid is: %s', request.query_params.get('buid'))
    asyncio.create_task(background_task(
        await save_zip_file(form_data["zip_file"]), 
        json.loads(form_data["hashmap"]),
        request.query_params.get('identification'),
        request.query_params.get('buid')
    ))

    return {"message": "Background task started"}

above code is for fetching the args from request,

async def background_task(zip_path, hashmap_dict, identification, buid):
    logging.info('Starting background task: Detection, Verification and Identification')
    await asyncio.to_thread(image_rekognition, zip_path, hashmap_dict, identification, buid)

async def save_zip_file(uploaded_file):
    # Save the zip file temporarily
    filename = secure_filename(uploaded_file.filename)
    zip_path = os.path.join(common_path, filename)
    async with aiofiles.open(zip_path, 'wb') as f:
        contents = await uploaded_file.read()
        await f.write(contents)
        print(zip_path)
    return zip_path


def image_rekognition(zip_path, hashmap, identification, buid):
    logging.info('Detect faces starting')
    trip_id = str(uuid.uuid4())
    # update_poll_time(buid, trip_id, identification)
    detect_faces2(zip_path, trip_id, hashmap, buid)

def detect_faces2(zip_path, trip_id, hashmap, buid):
    try:
        # Extract the zip file
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(image_directory)

        image_files = [file for file in os.listdir(image_directory) if file.endswith(('.jpg', '.jpeg', '.png'))]

        detect_bounding_boxes(image_files, trip_id, hashmap, buid)

        shutil.rmtree(image_directory)
        os.remove(zip_path)  # Remove the temporarily saved zip file

    except Exception as e:
        report_exception( e)
        logging.error("Exception occurred in detect_faces2 function: %s", e)

this code is for temporarily saving the files in img_directory,
and coming to the storage part, we are saving them on the disk [not on VRAM] because of multiple requests at once [and each zip file may alone can have 25 MB and 100’s of requests should be handled. And the process is running on normal EC2 instance which have 2 core’s and 8 gigs RAM.
and profiling yes, i have used cprofile and htop to track the resource consumption. i got to know threads are not getting handled correctly. need to do something about that. yes, prioritizing the post request will help but all the zip files will be piled on instance and may take time to process all the images. at present i have no clue how to test correct existing flow. need some one who is good at fastapi as i’m just getting my hands on to it.

I don’t know anything about fastapi, so I’m no help there. I’m just considering what you are trying to achieve at a system level - hundreds of 25 MB zip files simultaneously, unzipping and running face detection on them. You want to achieve “at least 50 per second” - 50 post requests per second, and the full pipeline of unzipping and processing the images? Continuously?

If so, that sounds like a tall order to me. 50*25 is 1.25 GB, so just to receive the data you need a high performance network connection, and saving that much data to disk that fast is a tall order as well. I’m not up on the current mass storage performance levels, but that sounds like a lot to ask. (just receiving and saving the zip files alone, that is.)

Outside of fastapi or your network POST events etc. Just individual system level tests using system level tools or special-purpose programs:

  1. Run some tests to see how fast you can write files to disk. How long does it take to write 1.25 GB to disk?
  2. Run some tests unzipping files. Do it from a ramdisk to remove the drive read bottleneck. How long does it take to unzip 1.25 GB of zip files?
  3. Run some tests to open and decode JPEG images. Again, read from a ramdisk to minimize the IO bottleneck. How long does it take to decode 1.25 GB of zipped jpegs?
  4. Run some tests using the face detection on the decoded JPEG images. How long does it take to run the face detection on 1.25 GB of zipped and decoded JPEGS?

I’d run all of these tests using one core and with nothing else running on the system. If I were a betting man I’d wager that the resource requirements for your task are much higher than the resources you have, no matter how you use them or arrange your pipeline.

Some thoughts:

  1. I wouldn’t be writing anything to disk, period. You’ve got 8 GB of RAM to work with? Receive the ZIP file into ram, unzip to ram, and decode the jpegs to ram, never touching the disk.
  2. Why are you using zip files on jepgs? Are you achieving any meaningful compression? Are you just using the zip as a container for multiple images? If so, maybe tar is a better choice - or zip with minimal compression. (I’m concerned that the unzipping process is taking up precious compute time for minimal benefit)
  3. You might need more cores, more memory, lower throughput requirements, or a mix of the three.
  4. Are there any other resources available to you (SIMD, GPU, more cores/memory?)

TL;DR: Determine the best performance of the individual steps, improve or eliminate the slow ones. Adjust your requirements and/or resources to match reality.

1 Like

Yes, i too know achieving those numbers is of great difficulty, [ i don’t know if it is possible on single instance]. basically the flow will be mobile devices captures the images zips them and then send it to server, like wise, there will be so many devices out there sending the requests at the same time. and i also need to perform detection, recognition and marking attendance in the whole flow. I’m going to experiment celery task scheduler as it follows distributed queue mechanism. and if you have any suggestions on multi-processing [ i mean in parallel which would work out this case even by slighter performance, please let me know. I’ve tried threading, concurrent futures, celery[have to try but need another instance], dask, joblib for parallelizing the tasks.
if there are any left out, let me know.
and your suggestions are highly appriceated.
Thank you.

Frankly you need to stop thinking about any of these things for now. Pretend like they don’t exist. First you need to figure out what your resource needs are for the problem you are trying to solve. All of the parallel / threading / concurrent futures / other buzzwords are good for is making efficient use of the resources you have. They aren’t magic and they don’t make your system faster than it is.

What you need to do is understand the performance of each step at a basic level. Forget about threads and parallelism and focus on the resource requirements to achieve what you want to achieve. If those resources are available, then you can think about how to organize them to achieve your goals.

Goal: Receive 50 zip files (25 mb each) per second, unzip them, decode the jpegs, and run a face detection routine on the decoded images.

  1. How many MB per second can you receive?
  2. How many MB per second can you unzip?
  3. How many MB of jpegs can you decode per second?
  4. How many MB per second can you process?

Once you can answer these questions you will have a better idea where you stand. You might be surprised how fucked you are.

I’ll make a few assumptions to illustrate my point:

  1. You need to receive and process 1GB/sec of zipped jpegs.
  2. The zip file is the same size as the unzipped jpegs (jpegs are already compressed, I suspect you aren’t getting any benefit from zipping them.)
  3. The jpegs are compressed 10:1 (this is a fairly typical ratio for jpegs)

This means that you need to decode about 1 GB of jpegs (10GB of raw images) per second.

Using data from this link: Analysis of JPEG Decoding Speeds – Brian C. Becker

and assuming you are using a typical/standard jpeg decoder, you would need appx 100 cores to decode 1GB of jpegs per second (10 GB decoded size). The article is old, the processor was a 2.4 GHz core 2 duo, so maybe you can do it with 50 cores, which is still a lot.

I’ll assume that 50 cores is out of the question for you. So now what? Notice (from the article) that libjpeg_simd was 5x faster than the typical ones? Can you use that? If so, maybe you can get down to 10 cores. Still too many cores? Can you use a GPU - there are apparently GPU based jpeg decoders that (I imagine) are even faster.

Note that so far all we are talking about is decoding jpegs from ram to ram and it looks like we need at least 5 modern CPU cores to do it (and likely more.) You have 2, and no amount of celery can turn 2 cores into 5. (you’re gonna need more cores.)

If you add in the unzipping step, and you insist on storing the zip files and jpegs on disk, I’d guess you will be able to support 5-10% of your desired load one one node. If you need 20 nodes (each with 2 cores and 8GB of ram) to handle the load, is that an option for you? Maybe you could have a dedicated node for receiving / dispatching the POST data to the worker nodes?