Processing image stack with CUDA function

How are you timing that, are you including the upload/download to the GPU in the timing? Which GPU/CPU are you comparing? Are you timing a single call, the time for the first run on the GPU is always orders of magnitude greater than subsequent ones? Are you using C++ or python?