Processing image stack with CUDA function

You may find this post useful

Essentially there could be a lot of things at play but I would first try timing the second or third run on the GPU as it should be substantially faster than the first.

Additionally your GPU is on the lower end of Nvidia’s line up. For example an RTX 3080 has ~8x the floating point performance of your card (that’s crazy when compared to a CPU, I would really struggle to find something with 1/8 the performance of an i9). I mention this because you may not see such a great performance increase in going from a top end Intel chip to a low end Nvidia GPU.