DNN inference with CUDA Stream

Is there any way to run DNN inference using the CUDA Streams interface? I would really like to add a NN filter to an existing pipeline without having to download the GpuMat or split the pipeline into two asynchronous calls.

How would you use CUDA streams if you cannot pass GpuMat’s directly to the DNN module (GpuMat as input/output to cv::dnn::Net · Issue #16433 · opencv/opencv · GitHub)?

Yes I have been reading these threads. It seems the best way forward is to work in __cuda_array_interface__ and use CuPy to get the data into a PyTorch tensor. Not yet sure if I will have to wait for the OpenCV stream to complete before starting the PyTorch processing. Ideally I would push the PyTorch inference onto the existing OpenCV stream.