Running dnn models inside different threads on different gpu device

I do this inside each dnn algorithm class constructor,

this->cuda_id = cuda_id;
net = cv::dnn::readNetFromCaffe(model_deploy, model_bin);

and init them inside every thread’s run function. My PC has 4x3080 gpus. when the batch num is big (like >4 ) it crashed. However, It is working with batch =16 with single gpu, and running 4 independent single gpu process on different device is also working.

what else should to be done to running in one process with multiple threads?