Thanks for the answer, but this is not an option. Matrix A’s purpose is being a data storage in GPU-Memory that I only need to upload once and then processing it row-wise. Unfortunately I can’t process A.row(i+1) before A.row(i) is processed because B is modified after every row like B=B+myFunction(A.row[i]*B).
You are right, the bigger the matrices, the more efficient would it be to use cuda::gemm instead of cv::gemm and doing the same with A[1024,8192]*B[8192,8192] is 20 times faster row-wise and 50x faster in bulk on my GPU. But my data is as big as it is and can’t be processed in a bulk and I still don’t understand why a small matrix multiplication of already uploaded matrices is 20 times slower on a modern GPU than on an old single threaded CPU