Frankly you need to stop thinking about any of these things for now. Pretend like they don’t exist. First you need to figure out what your resource needs are for the problem you are trying to solve. All of the parallel / threading / concurrent futures / other buzzwords are good for is making efficient use of the resources you have. They aren’t magic and they don’t make your system faster than it is.
What you need to do is understand the performance of each step at a basic level. Forget about threads and parallelism and focus on the resource requirements to achieve what you want to achieve. If those resources are available, then you can think about how to organize them to achieve your goals.
Goal: Receive 50 zip files (25 mb each) per second, unzip them, decode the jpegs, and run a face detection routine on the decoded images.
- How many MB per second can you receive?
- How many MB per second can you unzip?
- How many MB of jpegs can you decode per second?
- How many MB per second can you process?
Once you can answer these questions you will have a better idea where you stand. You might be surprised how fucked you are.
I’ll make a few assumptions to illustrate my point:
- You need to receive and process 1GB/sec of zipped jpegs.
- The zip file is the same size as the unzipped jpegs (jpegs are already compressed, I suspect you aren’t getting any benefit from zipping them.)
- The jpegs are compressed 10:1 (this is a fairly typical ratio for jpegs)
This means that you need to decode about 1 GB of jpegs (10GB of raw images) per second.
Using data from this link: Analysis of JPEG Decoding Speeds – Brian C. Becker
and assuming you are using a typical/standard jpeg decoder, you would need appx 100 cores to decode 1GB of jpegs per second (10 GB decoded size). The article is old, the processor was a 2.4 GHz core 2 duo, so maybe you can do it with 50 cores, which is still a lot.
I’ll assume that 50 cores is out of the question for you. So now what? Notice (from the article) that libjpeg_simd was 5x faster than the typical ones? Can you use that? If so, maybe you can get down to 10 cores. Still too many cores? Can you use a GPU - there are apparently GPU based jpeg decoders that (I imagine) are even faster.
Note that so far all we are talking about is decoding jpegs from ram to ram and it looks like we need at least 5 modern CPU cores to do it (and likely more.) You have 2, and no amount of celery can turn 2 cores into 5. (you’re gonna need more cores.)
If you add in the unzipping step, and you insist on storing the zip files and jpegs on disk, I’d guess you will be able to support 5-10% of your desired load one one node. If you need 20 nodes (each with 2 cores and 8GB of ram) to handle the load, is that an option for you? Maybe you could have a dedicated node for receiving / dispatching the POST data to the worker nodes?