if you are concerned about numpy producing a lot of temporary data, just write your kernels in plain python and then use @numba.njit
or write OpenCL kernels and run them with pyopencl
or use OpenCV functions, with cv.UMat, but then you’re back to intermediate results for a bunch of stuff, same as with numpy.
I hear there exist GPU-accelerated flavors of numpy too