I plan to rewrite an application to make use of Cuda cores on a Jetson Nano. I previously rewrote much of it with a pipelining design (mostly turning per-element references into equivalent matrix operations) to run under OpenCL, but recently discovered that the Nano GPU capabilities can only be accessed via Cuda. Since the use of streams and cores is new to me, I’d like a sanity check on how I could make use of streams. At a top level, I’m counting on parallelism to be applied automatically within matrices when I use the OpenCV matrix and image processing functions, but I’m also interested in parallel execution of different kernels for which I would assign a stream each to complete some pipeline of processing, described at a high level below. I don’t quite follow how cores would be divvied up in such a scenario.
- Stream 0
- Transfer source image
- (pipeline forks to Stream 1 and Stream 2 below)
- Stream 1 (after Stream 0 done)
- Foveal image generation (simply a window from source)
- Foveal optic flow estimation
- Foveal edge detection
- (pipeline forks to Streams 3 … n below)
- Stream 2 ((after Stream 0 done)
- Peripheral vision image generation (simply an image resize to smaller)
- Peripheral vision optic flow estimation
- (no edge detection)
- (pipeline forks to Streams n+1 … m below)
- Stream 3 … n (after Stream 1 done, but all async wrt one another, but reading/writing to some shared data)
- TBD # of foveal color patch discovery and updates
- Return of foveal color patch and edge list
- Stream n+1 … m (after Stream 2 done, all async wrt one another, but reading/writing to some shared data)
- TBD # of peripheral vision color patch discovery and updates
- Return of peripheral vison color patch and edge list
Does this approach make any sense, or am I way off base here on how to use streams? Will each stream get assigned to a separate core? I’m expecting multiple streams to make use of the same set of functions in parallel, but applying them to different parts of a source image. Is that OK? I expected that I’ll probably assign more than 128 streams for the latter two stream sets described above, which exceeds the Nano’s capability, but probably not by much.
Hi, i would say you are off base possibley because you may be trying to apply OpenCL concepts to CUDA but as I am not familiar with OpenCL I can’t be sure. A CUDA stream is simply a que of work for execution on the GPU. In “general” with OpenCV you would execute kernels on the GPU which work on entire images and you “would/could” use multiple streams to que work for several images at once where the GPU would decide which kernel to apply to which image in an optimum way.
e.g. You may have 2 CPU threads each perfoming a simple resize on the GPU on there own set of images as below
Ignoring the fact that you would never do this (perform and upload/download for a simple resize) due to the memory overhead you could perform this without the stream argument. In that case the GPU would only be able to process either an upload/download/resize on one image from one thread at one time. If however you use streams the GPU may for example be able to schedule an upload/download on an image in CPU thread 1 which overlaps with an upload/download/resize operation on an image in CPU thread 2.
Another good reason for using streams in OpenCV is to prevent automatic device synchronization after every kernel memory operation you execute, which is the default behaviour when no stream argument is given or the default stream is used.
Each stream will be executed on as many cores as is necessary, most likely the whole device.
Multiple streams will allow the device to schedule the work on the GPU in an optimum way where you can to some extent overlap kernels with each other (if they are small enough) and with memory operations (also memory operations with other memory operations if you have sufficient copy engines). Therefore you could process different parts of the image in seperate streams but you may not want to unless the work is completely seperate.
Additionally you can leverage multiple streams to overlap memory operations and kernels in the same CPU thread. It may help to read through an example I wrote a few years ago to get an idea of how this can work
I hope this helps, if not please let me know.