Processing image stack with CUDA function

grsabourin · May 13, 2021, 4:17am

Hi,

I have a stack of images (DICOM) that I would like to apply CUDA version of fastNlMeansDenoising filter. Is someone can suggest a good example to perform this operation. I’m looking to process stack as a batch because applying CUDA function on a single image one after the other is not efficient and longer than using CPU.

Thank you

berak · May 13, 2021, 6:46am

i’m having some doubts, if this is possible.

the batch version only applies to CPU

cudawarped · May 13, 2021, 9:00am

Hi, when you say not efficient do you mean that it does not take advantage of the dependency between consecutive frames which I assume fastNlMeansDenoisingMulti() does or do you just mean it is slower than the CPU version? If it is the latter you may be able to optimize your code.

grsabourin · May 13, 2021, 1:01pm

My experience with fastNlMeansDenoisingMulti() on GPU is it takes more time to process single image on GPU than on a CPU. My job is to process between 500 to 1000 images. I see my problem as doing processing on a video stream. Can I have a gain to do it as a video stream?

cudawarped · May 13, 2021, 1:04pm

How are you timing that, are you including the upload/download to the GPU in the timing? Which GPU/CPU are you comparing? Are you timing a single call, the time for the first run on the GPU is always orders of magnitude greater than subsequent ones? Are you using C++ or python?

grsabourin · May 13, 2021, 1:25pm

Yes my timing for now is on a single call. Do you think is I do a loop on my stack it can be faster? Latest test were on a I9 CPU and a Quadro T2000 for GPU

cudawarped · May 13, 2021, 1:38pm

You may find this post useful

Essentially there could be a lot of things at play but I would first try timing the second or third run on the GPU as it should be substantially faster than the first.

Additionally your GPU is on the lower end of Nvidia’s line up. For example an RTX 3080 has ~8x the floating point performance of your card (that’s crazy when compared to a CPU, I would really struggle to find something with 1/8 the performance of an i9). I mention this because you may not see such a great performance increase in going from a top end Intel chip to a low end Nvidia GPU.

grsabourin · May 13, 2021, 1:45pm

Thank you for reply. The features I gave you is about my development system. The system that is targeted has other type of CPU with probably less performances.

crackwitz · May 13, 2021, 3:49pm

some material on cuda streams, which is not necessarily beginner-friendly, so feel free to ignore it, or browse out of curiosity:

according to this, a Quadro T2000 has 3.6 Tflop/s of FP32, so that’s still orders of magnitude better than a CPU.

( @cudawarped please don’t compare to the latest and greatest. newbies lack the context to understand such judgments. this hardware is plenty powerful. it’s absolutely not the issue here. )

your whole problem is not realizing that transfer of data and commands takes time.

getting a GPU to perform is all about dealing with latency.

those are the basics.

please don’t jump into GPU programming. your issues will be tiresome because they wouldn’t come up with a proper approach.

start with the basics of GPU programming. that means learning what’s special about GPUs and what to pay attention to.

you do not start by coming up with your own questions and poking around for answers. you look for structured teaching. get experts to tell you what you need to know. as a newbie you can’t know what you need to know, and you aren’t expected to.

hell, nvidia has tons of documentation and tutorials and introductions. they wouldn’t make any money if their educational content sucked. please go and look for good learning material.

cudawarped · May 13, 2021, 4:15pm

Fair enough, that was an aside the real culprits are detailed in the post I linked to. In my defense the latest i9 is the premium offering from Intel so I would argue that it is not fair to compare its performance to anything other than a premium offering from Nvidia, when determining if the GPU is suitable for your workload.

I would urge caution against comparing the Tflop/s of a CPU vs a GPU as this is highly algorithm dependent and even more so in OpenCV where a lot of the CUDA algorithms were written for depreciated compute capabilities (old), use the npp’s which can have poor performance, and some which have to examine GPU results on the CPU half way through to determine the next round of GPU computation.

grsabourin · May 16, 2021, 11:28am

Thank you for all you answers. I had a first success with using a for loop and apply GPU processing successively. 88 seconds compare to 3.5 seconds on GPU with the same dataset. This is a good start.

Topic		Replies	Views
OpenCV CUDA extremely slow cuda	3	6849	April 30, 2021
Is there a way to use Cuda version of fastNlMeansDenoising in python? Python cuda	2	624	October 10, 2022
Some opencv cudafilter functions is slower than CPU code on Jetson Xavier NX C++ filter , cuda , cudaarithm	1	320	November 8, 2023
Nvidia Notebook 3050 Ti and i7 12700 is giving same performance for debayering RAW to RGB Python cuda , imgproc	6	352	March 6, 2023
Reading and Writing Videos: Python on GPU with CUDA - VideoCapture and VideoWriter Python	17	28640	February 8, 2024

Processing image stack with CUDA function

Related topics