Processing image stack with CUDA function

Hi,

I have a stack of images (DICOM) that I would like to apply CUDA version of fastNlMeansDenoising filter. Is someone can suggest a good example to perform this operation. I’m looking to process stack as a batch because applying CUDA function on a single image one after the other is not efficient and longer than using CPU.

Thank you

i’m having some doubts, if this is possible.

the batch version only applies to CPU

Hi, when you say not efficient do you mean that it does not take advantage of the dependency between consecutive frames which I assume fastNlMeansDenoisingMulti() does or do you just mean it is slower than the CPU version? If it is the latter you may be able to optimize your code.

My experience with fastNlMeansDenoisingMulti() on GPU is it takes more time to process single image on GPU than on a CPU. My job is to process between 500 to 1000 images. I see my problem as doing processing on a video stream. Can I have a gain to do it as a video stream?

How are you timing that, are you including the upload/download to the GPU in the timing? Which GPU/CPU are you comparing? Are you timing a single call, the time for the first run on the GPU is always orders of magnitude greater than subsequent ones? Are you using C++ or python?

Yes my timing for now is on a single call. Do you think is I do a loop on my stack it can be faster? Latest test were on a I9 CPU and a Quadro T2000 for GPU

You may find this post useful

Essentially there could be a lot of things at play but I would first try timing the second or third run on the GPU as it should be substantially faster than the first.

Additionally your GPU is on the lower end of Nvidia’s line up. For example an RTX 3080 has ~8x the floating point performance of your card (that’s crazy when compared to a CPU, I would really struggle to find something with 1/8 the performance of an i9). I mention this because you may not see such a great performance increase in going from a top end Intel chip to a low end Nvidia GPU.

Thank you for reply. The features I gave you is about my development system. The system that is targeted has other type of CPU with probably less performances.

some material on cuda streams, which is not necessarily beginner-friendly, so feel free to ignore it, or browse out of curiosity:

according to this, a Quadro T2000 has 3.6 Tflop/s of FP32, so that’s still orders of magnitude better than a CPU.

( @cudawarped please don’t compare to the latest and greatest. newbies lack the context to understand such judgments. this hardware is plenty powerful. it’s absolutely not the issue here. )

your whole problem is not realizing that transfer of data and commands takes time.

getting a GPU to perform is all about dealing with latency.

those are the basics.

please don’t jump into GPU programming. your issues will be tiresome because they wouldn’t come up with a proper approach.

start with the basics of GPU programming. that means learning what’s special about GPUs and what to pay attention to.

you do not start by coming up with your own questions and poking around for answers. you look for structured teaching. get experts to tell you what you need to know. as a newbie you can’t know what you need to know, and you aren’t expected to.

hell, nvidia has tons of documentation and tutorials and introductions. they wouldn’t make any money if their educational content sucked. please go and look for good learning material.

Fair enough, that was an aside the real culprits are detailed in the post I linked to. In my defense the latest i9 is the premium offering from Intel so I would argue that it is not fair to compare its performance to anything other than a premium offering from Nvidia, when determining if the GPU is suitable for your workload.

I would urge caution against comparing the Tflop/s of a CPU vs a GPU as this is highly algorithm dependent and even more so in OpenCV where a lot of the CUDA algorithms were written for depreciated compute capabilities (old), use the npp’s which can have poor performance, and some which have to examine GPU results on the CPU half way through to determine the next round of GPU computation.

Thank you for all you answers. I had a first success with using a for loop and apply GPU processing successively. 88 seconds compare to 3.5 seconds on GPU with the same dataset. This is a good start.