some material on cuda streams, which is not necessarily beginner-friendly, so feel free to ignore it, or browse out of curiosity:
- https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf
- https://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf
- https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
according to this, a Quadro T2000 has 3.6 Tflop/s of FP32, so that’s still orders of magnitude better than a CPU.
( @cudawarped please don’t compare to the latest and greatest. newbies lack the context to understand such judgments. this hardware is plenty powerful. it’s absolutely not the issue here. )
your whole problem is not realizing that transfer of data and commands takes time.
getting a GPU to perform is all about dealing with latency.
those are the basics.
please don’t jump into GPU programming. your issues will be tiresome because they wouldn’t come up with a proper approach.
start with the basics of GPU programming. that means learning what’s special about GPUs and what to pay attention to.
you do not start by coming up with your own questions and poking around for answers. you look for structured teaching. get experts to tell you what you need to know. as a newbie you can’t know what you need to know, and you aren’t expected to.
hell, nvidia has tons of documentation and tutorials and introductions. they wouldn’t make any money if their educational content sucked. please go and look for good learning material.