Calling matrix multiplication from python

if you are concerned about numpy producing a lot of temporary data, just write your kernels in plain python and then use @numba.njit

or write OpenCL kernels and run them with pyopencl

or use OpenCV functions, with cv.UMat, but then you’re back to intermediate results for a bunch of stuff, same as with numpy.

I hear there exist GPU-accelerated flavors of numpy too