Thanks for the tips, @cudawarped !
The main problem with my code was the slow startup time of the GPU. If I launch the code on a set of images, the first one takes long time, the rest is much (like ~30 times) faster.
In my case the other optimizations had little impact on performance - that said, these are good practices on CUDA programming (like pre-allocating GPU memory for images) and they can have impact on larger operations…