welcome.
before you start engineering, do a prototype using cheap/borrowed webcams. you gotta get a feel for the mess you’re getting into.
if you care to, implement a little 3d scene in which you can play quickly and cheaply. you can create arbitrarily specified virtual cameras to generate synthetic views, and you know what all the results should look like (camera/transformation matrices…)
hardware: start from specifying a desired depth resolution at a desired distance. the depth resolution from a stereo pair isn’t uniform. consider the disparity map. a pixel difference at infinity could be huge, a pixel difference on the tip of your nose is nearly nothing.
that then implies some constraints on baseline (distance between cams), resolution, field of view. these three properties are somewhat “tradeable”.
desired distance also constrains your choice of lens. you’ll want fixed but adjustable focus (if servo focus then manually settable, not auto). cameras with focus fixed to infinity only make sense if your working distance is far. consider depth of field/focus. you may have to choose a narrow aperture. that would then imply strong lighting… or maybe your situation is relaxed in those dimensions.
cheap webcams with infinity focus can be disassembled. their built in lenses have adjustable focus but you’ll have to break some loctite seal on the lens in order to turn it.
make sure the cameras/lenses are mounted so they don’t move at all and they don’t shake/vibrate from minor disturbances. you don’t want to recalibrate the rig every time someone sneezes.
don’t be afraid of high res cameras. you can always scale that down. it also gives you some room for numerical error from undistortion and rectification (resampling operations).
you want a sharp picture but you don’t want software sharpening… or software noise suppression. that just destroys high frequency texture along with high frequency noise. good lighting helps, once again.
you want both cameras to make pictures at the same instant. if you can, get cameras you can link together (one runs, triggers the other synchronously) or whose exposures can be triggered in software.
look at the effect of “rolling shutter”. consider if you need global shutter or not. if you use cams with rolling shutter, make sure to hold your calibration patterns still, or else you’ll get bad results proportional to how much you shook the pattern.
consider the data rate of the video feeds. if USB (2.0 HS with 480 Mbit/s in particular, but also USB 3), make sure to plug them into individual USB controllers, so they don’t have to share bandwidth. if they do have to, that will affect picture quality (one camera won’t get full bandwidth and instead negotiate lower res, lower frame rate, or awful compression).