Hello,
after finally managing it to build OpenCV 4.10.0 with Cuda support and trying around a bit, I was able to sucessfully accelerate several things, especially element-wise matrix operations using Cuda.
But if it’s about matrix multiplication (NOT element-wise), cuda seems to be damn slow. It takes about 1 second to multiply a few rows with a matrix on GPU, when it takes less than a millisecond on my very old Intel Core i7-2600K CPU. I know that my motherboard and its PCI slots are not the fastest to fully use my NVIDIA GeForce RTX 3060 GPU.
I did a speed test and multiplied five times 1024 1x4 rows with a 4x8 matrix and measured the time. The results are:
GPU using GEMM: 0.974147 0.918417 0.922382 0.923081 0.941534
CPU using GEMM: 0.000531 0.0004981 0.0004958 0.0004959 0.0004959
CPU using * Operator: 0.0006888 0.0006762 0.0006743 0.0006861 0.0006755
Meaning that CPU is about 2000 times faster than GPU.
source:
#include <iostream>
#include <opencv2/opencv.hpp>
#include <opencv2/cudaarithm.hpp>
using namespace std;
using namespace cv;
using namespace cuda;
int main()
{
int M = 1024;
int N = 4;
int O = 8;
int type = CV_64FC1; //yes I need double but float won't make it faster anyway
int maxIter = 5;
Mat cA = Mat(M, N, type);
Mat cB = Mat(N, O, type);
Mat cC,cDummy;
randu(cA, Scalar(-0.5), Scalar(0.5));
randu(cB, Scalar(-0.5), Scalar(0.5));
GpuMat gA = GpuMat(cA);
GpuMat gB = GpuMat(cB);
GpuMat gC,gDummy;
TickMeter tm;
cout << endl << "GPU using GEMM:\t";
for (int i = 0; i < maxIter; i++)
{
tm.reset(); tm.start();
for (int m = 0; m < M; m++)
cuda::gemm(gA.row(m), gB, 1.0, gDummy, 0, gC);
tm.stop();
cout << tm.getTimeSec() << "\t";
}
cout << endl << "CPU using GEMM:\t";
for (int i = 0; i < maxIter; i++)
{
tm.reset(); tm.start();
for (int m = 0; m < M; m++)
cv::gemm(cA.row(m), cB, 1.0, cDummy, 0, cC);
tm.stop();
cout << tm.getTimeSec() << "\t";
}
cout << endl << "CPU using * Operator:\t";
for (int i = 0; i < maxIter; i++)
{
tm.reset(); tm.start();
for (int m = 0; m < M; m++)
cC = cA.row(m) * cB;
tm.stop();
cout << tm.getTimeSec() << "\t";
}
cuda::printCudaDeviceInfo(getDevice());
}
I need double precision and CV_32FC1 won’t make it faster. Some might also notice that I could also multiply the whole Matrix gA with gB instead of iterating over the rows but in my case I need to do it like that (it’s just an example). Ironically multiplying the whole Matrix gA with gB takes about the same time as multiplying a single row of gA with GB.
So why is the GPU that slow in this case? Ain’t GPU’s designed for parallel matrix multiplications? How can I make it faster? If it’s about cuda stream then give me an example please, because I know about matrix algebra but not about hardware related cuda stuff.
cuda::printCudaDeviceInfo(getDevice()); :
*** CUDA Device Query (Runtime API) version (CUDART static linking) ***
Device count: 1
Device 0: "NVIDIA GeForce RTX 3060"
CUDA Driver Version / Runtime Version 12.60 / 12.20
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 12287 MBytes (12884246528 bytes)
GPU Clock Speed: 1.81 GHz
Max Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384)
Max Layered Texture Size (dim) x layers 1D=(32768) x 2048, 2D=(32768,32768) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 5 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.60, CUDA Runtime Version = 12.20, NumDevs = 1