CUDA GEMM Matrix Multiply is very slow

BadTrip · March 10, 2025, 3:44pm

Thanks for the answer, but this is not an option. Matrix A’s purpose is being a data storage in GPU-Memory that I only need to upload once and then processing it row-wise. Unfortunately I can’t process A.row(i+1) before A.row(i) is processed because B is modified after every row like B=B+myFunction(A.row[i]*B).

You are right, the bigger the matrices, the more efficient would it be to use cuda::gemm instead of cv::gemm and doing the same with A[1024,8192]*B[8192,8192] is 20 times faster row-wise and 50x faster in bulk on my GPU. But my data is as big as it is and can’t be processed in a bulk and I still don’t understand why a small matrix multiplication of already uploaded matrices is 20 times slower on a modern GPU than on an old single threaded CPU

Topic		Replies	Views
OpenCV CUDA extremely slow cuda	3	6710	April 30, 2021
OpenCV supported CUDA calcCovarMatrix C++ cuda	1	46	August 25, 2024
Performing reshape on GpuMat cause unexpected error C++ cuda	1	349	September 27, 2021
OpenCV Optical Flow Cuda Naiva Implementation Slower then CPU Python cuda	3	389	April 4, 2024
CUDA: SIFT or SURF, disappointed by execution timings cuda	6	3510	December 29, 2022

CUDA GEMM Matrix Multiply is very slow

Related topics