CUDA GEMM Matrix Multiply is very slow

BadTrip · March 10, 2025, 1:07pm

Hello,

after finally managing it to build OpenCV 4.10.0 with Cuda support and trying around a bit, I was able to sucessfully accelerate several things, especially element-wise matrix operations using Cuda.

But if it’s about matrix multiplication (NOT element-wise), cuda seems to be damn slow. It takes about 1 second to multiply a few rows with a matrix on GPU, when it takes less than a millisecond on my very old Intel Core i7-2600K CPU. I know that my motherboard and its PCI slots are not the fastest to fully use my NVIDIA GeForce RTX 3060 GPU.

I did a speed test and multiplied five times 1024 1x4 rows with a 4x8 matrix and measured the time. The results are:


GPU using GEMM:	0.974147	0.918417	0.922382	0.923081	0.941534	
CPU using GEMM:	0.000531	0.0004981	0.0004958	0.0004959	0.0004959	
CPU using * Operator:	0.0006888	0.0006762	0.0006743	0.0006861	0.0006755

Meaning that CPU is about 2000 times faster than GPU.

source:

#include <iostream>
#include <opencv2/opencv.hpp>
#include <opencv2/cudaarithm.hpp>

using namespace std;
using namespace cv;
using namespace cuda;

int main()
{
    int M = 1024;
    int N = 4;
    int O = 8;
    int type = CV_64FC1; //yes I need double but float won't make it faster anyway
    int maxIter = 5;
    
    Mat cA = Mat(M, N, type);
    Mat cB = Mat(N, O, type);    
    Mat cC,cDummy;

    randu(cA, Scalar(-0.5), Scalar(0.5));
    randu(cB, Scalar(-0.5), Scalar(0.5));

    GpuMat gA = GpuMat(cA);
    GpuMat gB = GpuMat(cB);
    GpuMat gC,gDummy;

    TickMeter tm;   

    cout << endl << "GPU using GEMM:\t";    
    for (int i = 0; i < maxIter; i++)
    {
        tm.reset(); tm.start();
        for (int m = 0; m < M; m++)
            cuda::gemm(gA.row(m), gB, 1.0, gDummy, 0, gC);
        tm.stop();
        cout << tm.getTimeSec() << "\t";
    }

    cout << endl << "CPU using GEMM:\t";
    for (int i = 0; i < maxIter; i++)
    {
        tm.reset(); tm.start();
        for (int m = 0; m < M; m++)
            cv::gemm(cA.row(m), cB, 1.0, cDummy, 0, cC);
        tm.stop();
        cout << tm.getTimeSec() << "\t";
    }

    cout << endl << "CPU using * Operator:\t";
    for (int i = 0; i < maxIter; i++)
    {
        tm.reset(); tm.start();
        for (int m = 0; m < M; m++)
            cC = cA.row(m) * cB;
        tm.stop();
        cout << tm.getTimeSec() << "\t";
    }        

    cuda::printCudaDeviceInfo(getDevice());
}

I need double precision and CV_32FC1 won’t make it faster. Some might also notice that I could also multiply the whole Matrix gA with gB instead of iterating over the rows but in my case I need to do it like that (it’s just an example). Ironically multiplying the whole Matrix gA with gB takes about the same time as multiplying a single row of gA with GB.

So why is the GPU that slow in this case? Ain’t GPU’s designed for parallel matrix multiplications? How can I make it faster? If it’s about cuda stream then give me an example please, because I know about matrix algebra but not about hardware related cuda stuff.

cuda::printCudaDeviceInfo(getDevice()); :

*** CUDA Device Query (Runtime API) version (CUDART static linking) *** 

Device count: 1

Device 0: "NVIDIA GeForce RTX 3060"
  CUDA Driver Version / Runtime Version          12.60 / 12.20
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 12287 MBytes (12884246528 bytes)
  GPU Clock Speed:                               1.81 GHz
  Max Texture Dimension Size (x,y,z)             1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384)
  Max Layered Texture Size (dim) x layers        1D=(32768) x 2048, 2D=(32768,32768) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 5 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           1 / 0
  Compute Mode:
      Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version  = 12.60, CUDA Runtime Version = 12.20, NumDevs = 1

cudawarped · March 10, 2025, 2:36pm

Ironically multiplying the whole Matrix gA with gB takes about the same time as multiplying a single row of gA with GB.

It is much faster if you multiply the whole matrix at once. Performing it a single row at a time is a very inefficient way of using the GPU.

See my times below on an RTX 3070 Ti mobile where the second GPU using GEMM: is calculated in a single pass.

GPU using GEMM: 0.796625        0.751285        0.772211        0.742591        0.724888
GPU using GEMM: 0.0008568       0.0008351       0.0007413       0.0007073       0.0006382
CPU using GEMM: 0.0002673       0.0002283       0.000221        0.0002238       0.000214
CPU using * Operator:   0.0003206       0.0003053       0.0003073       0.0003089       0.0003199

Note: For a relatively small (by today’s standards) GPU like an RTX 3060 a 1024x1024 matrix is fairly small, GPU’s work best when they have a lot of uninterrupted work to do. You can observe this if you increase the matrix sizes. Below are the results for 8192 x 8192 matrix.

GPU using GEMM: 6.41995 6.16948 6.24548 6.16218 5.94419
GPU using GEMM: 0.0008902       0.0009698       0.0007414       0.0007812       0.0009502
CPU using GEMM: 0.0021444       0.0017757       0.0018039       0.0018446       0.0017067
CPU using * Operator:   0.0026671       0.0021779       0.0024081       0.0022823       0.0022718

BadTrip · March 10, 2025, 3:44pm

Thanks for the answer, but this is not an option. Matrix A’s purpose is being a data storage in GPU-Memory that I only need to upload once and then processing it row-wise. Unfortunately I can’t process A.row(i+1) before A.row(i) is processed because B is modified after every row like B=B+myFunction(A.row[i]*B).

You are right, the bigger the matrices, the more efficient would it be to use cuda::gemm instead of cv::gemm and doing the same with A[1024,8192]*B[8192,8192] is 20 times faster row-wise and 50x faster in bulk on my GPU. But my data is as big as it is and can’t be processed in a bulk and I still don’t understand why a small matrix multiplication of already uploaded matrices is 20 times slower on a modern GPU than on an old single threaded CPU

cudawarped · March 10, 2025, 6:28pm

In my prevous reply I didn’t notice the dimensions you were trying to use before, I thought both matrices were the same size.

I’ll answer your question but I’ll be glossing over a few details of how the GPU works as they don’t alter the overall message. The really quick answer is you are using a massively parallel processor to perform very very little work. Your calculating 8 values (1x4 x 4x8 = 1x8).

A slightly longer but still brief answer is that your GPU is composed of 26 SM’s (streaming multiprocessors) each of which is capable of calculating 512 output values at any one time (take this with a pinch of salt, its algorithm dependant and also not stricktly true 32 threads will execute at any one time but you can have up to 1536 resident at any one time to hide latency). Your matrix multiplicaiton is therefore only using 1/26th of the available compute and its also using it inefficiently because its calculating 8 out of the 512 values which its capable of calculating at any one time.

Note: An RTX 5090 has 170 SM’s so your situation does not improve if you get a faster GPU.

See the below for a more in depth introduction to GEMM calculations on the GPU

Topic		Replies	Views
OpenCV CUDA extremely slow cuda	3	6758	April 30, 2021
OpenCV supported CUDA calcCovarMatrix C++ cuda	1	46	August 25, 2024
Performing reshape on GpuMat cause unexpected error C++ cuda	1	352	September 27, 2021
OpenCV Optical Flow Cuda Naiva Implementation Slower then CPU Python cuda	3	403	April 4, 2024
CUDA: SIFT or SURF, disappointed by execution timings cuda	6	3584	December 29, 2022

CUDA GEMM Matrix Multiply is very slow

Related topics