are you measuring a single iteration of this ?
(kernels need to be compiled, caches warmed, etc)
are there more gpu ops in your pipeline ?
(up/downloading between cpu/gpu is expensive)
are you measuring a single iteration of this ?
(kernels need to be compiled, caches warmed, etc)
are there more gpu ops in your pipeline ?
(up/downloading between cpu/gpu is expensive)