Unified Memory problem

Hi, I’m testing unified memory, this is the first cuda program with opencv. The problem is that after the memory initialization and some cv::cuda computation (resize and cvtColor) I try to access from cpu to the buffer but it doesn’t change. below the code and output for check.
Code:

    void *image_Proc, *image_Res, *image_Cuda;
    // allocate unified memory space for image
    if (cudaSuccess != cudaMallocManaged(&image_Proc, maxImageByteSize))
        return (-1);
    if (cudaSuccess != cudaMallocManaged(&image_Res, maxImageByteSize))
        return (-2);
    if (cudaSuccess != cudaMallocManaged(&image_Cuda, maxImageByteSize))
        return (-3);

        try
        {
            for (int i = 0; i < fileNames.size(); ++i)
            {
                if (i == 0)
                {
                    cv::Mat loadImg = cv::imread(fileNames[i], cv::IMREAD_UNCHANGED); //img loaded is 1024*1024*3

                    cv::Mat imgCuda_n(loadImg.rows, loadImg.cols, loadImg.type(), image_Cuda);                                                         
                    cv::Mat imgProc_n(loadImg.rows, loadImg.cols, loadImg.type(), image_Proc); 
                    cv::Mat imgRes_n(loadImg.rows, loadImg.cols, loadImg.type(), image_Res);
                    // same memory used
                    cv::cuda::GpuMat imgCuda(loadImg.rows, loadImg.cols, loadImg.type(), image_Cuda);
                    cv::cuda::GpuMat imgProc(loadImg.rows, loadImg.cols, loadImg.type(), image_Proc); 
                    cv::cuda::GpuMat imgRes(loadImg.rows, loadImg.cols, loadImg.type(), image_Res);

                    cudaDeviceSynchronize();
                    //printing adresses
                    cout << "image_Proc address: " << &image_Proc << "\timgProc adress: " << &imgProc << "\timgProc_n adress: " << &imgProc_n << endl;
                    cout << "image_Res address: " << &image_Res << "\timage_Res address: " << &imgRes << "\timage_Res address: " << &imgRes_n  << endl;

                    imgCuda.upload(loadImg); //load image

                    //print dimensions before 
                    cout << "########\n\tBEFORE\n#########\n";
                    cout << "imgCuda_n dim: " << imgCuda_n.cols << " " << imgCuda_n.channels() << endl;
                    cout << "imgProc_n dim: " << imgProc_n.cols << " " << imgProc_n.channels() << endl;
                    cout << "imgRes_n dim: " << imgRes_n.cols << " " << imgRes_n.channels() << endl;
                    cout << "imgCuda dim: " << imgCuda.cols << " " << imgCuda.channels() << endl;
                    cout << "imgProc dim: " << imgProc.cols << " " << imgProc.channels() << endl;
                    cout << "imgRes dim: " << imgRes.cols << " " << imgRes.channels() << endl;

I upload also the output
the output: adresses of variables are different (don’t now if should be equal)
he size of Mat and gpuMat before are correct, but after for imgProc_n and imgRes_n aren’t.

image_Proc address: 0x7fffa96f7f30      imgProc adress: 0x7fffa96f8090  imgProc_n adress: 0x7fffa96f82f0
image_Res address: 0x7fffa96f7f38       image_Res address: 0x7fffa96f80d0       image_Res address: 0x7fffa96f8350
########
    BEFORE
#########
imgCuda_n dim: 1024 3
imgProc_n dim: 1024 3
imgRes_n dim: 1024 3
imgCuda dim: 1024 3
imgProc dim: 1024 3
imgRes dim: 1024 3
loading /home/edo/face/detection/build/testImg/00792.png
########
    AFTER
#########
imgCuda_n dim: 1024 3
imgProc_n dim: 1024 3
imgPRes_n dim: 1024 3
imgCuda dim: 1024 3
imgProc dim: 512 1
imgRes dim: 512 3

I may be wrong (your naming convention is hard to follow) but it looks like you are comparing the address of the Mat/GpuMat objects with the address of the memory that they point to?

What are you trying to check with this? The addresses of the memory should all be the same as you initialized the Mat/GpuMat objects with them. I would think the more important check would be that the managed memory works for you.

Just out of interset why are you using managed memory instead of explicitly uploading/downloading? Personally I have never used it as it seems you could easily cause yourself a performance penalty by relying on the memory being where you want it when you want it.

the output of adresses is just to test, i check using dimensions of mat and gpuMat after, for example imgProc and imgProc_n(ormal) should both 512 cols and 1 channel, the result is correct for imgProc but not for it’s corresponding Mat for cpu (imgProc_n) . I want to operate with gpu and dont need to download Mat after the computations, as performance i noticed a great improvement from the previus upload/download. for instance to use managed memory shuld i add some flags for cuda compilation?

The code for After is not shown?

I am suprised at that as they essentially do the same thing.

I’m sorry, i forgot a pice of code:

cout << "loading " << fileNames[i] << endl;

                    opencvProcess(imgCuda, imgRes, imgProc);
                    cudaDeviceSynchronize();

                    cout << "########\n\tAFTER\n#########\n";
                    cout << "imgCuda_n dim: " << imgCuda_n.cols << " " << imgCuda_n.channels() << endl;
                    cout << "imgProc_n dim: " << imgProc_n.cols << " " << imgProc_n.channels() << endl;
                    cout << "imgPRes_n dim: " << imgRes_n.cols << " " << imgRes_n.channels() << endl;
                    cout << "imgCuda dim: " << imgCuda.cols << " " << imgCuda.channels() << endl;
                    cout << "imgProc dim: " << imgProc.cols << " " << imgProc.channels() << endl;
                    cout << "imgRes dim: " << imgRes.cols << " " << imgRes.channels() << endl;

and there the computation using gpu

void opencvProcess(const cv::cuda::GpuMat imgCuda, cv::cuda::GpuMat &imgRes, cv::cuda::GpuMat &imgProc)
{
    cv::cuda::resize(imgCuda, imgRes, cv::Size(512, 512), cv::INTER_TAB_SIZE2, cv::INTER_CUBIC);
    cv::cuda::cvtColor(imgRes, imgProc, cv::COLOR_BGR2GRAY);
}

You haven’t included the code for the Mat’s, i.e. which function do you use to convert imgCuda_n to imgProc_n?

Additionaly:

  1. Your using the wrong signature for cv::cuda::resize, it works because fx and fy are ignored when you pass size, see the docs.
  2. You haven’t passed a stream so you don’t need cudaDeviceSynchronize();
  3. The reason you may have seen a speed up when using managed memory is most likely because you have pre-allocated imgRes and imgProc, you could have done this without managed memory which would be much cleaner and less error prone.
     GpuMat imgRes(loadImg.size(), loadImg.type());
     GpuMat imgProc(loadImg.size(), loadImg.type());
    
  4. I am still not clear what you are trying to do?

right now i have posted all the code about opencv and cuda.

  1. looking at the docs fx and fy are calculated form size that i pass, default is 0 and from doc if 0 are calculated as (double)dsize.width/src.cols.
  2. thanks, i’ll update de code.
  3. Yes, in fact in the previous version i didn’t pre-alloc.
  4. i’m trying to use unified memory, just to see how it works.

But i still don’t understand why it doesn’t work.

This is true but you are passing cv::INTER_TAB_SIZE2 and cv::INTER_CUBIC as fx and fy which are ignored. If you want to pass the interpolation method cv::INTER_CUBIC you need to call

cv::cuda::resize(imgCuda, imgRes, cv::Size(512, 512), 0, 0, cv::INTER_CUBIC);

Nothing that you have posted indicates that it doesn’t work.

the output code show that imgProc_n and imgProc has different dimensions and sould not, the same for imgRes and imgRes_n.
those 4 variables are allocated to use shared memory but when i work with imgProc and imgRes (function posted opencvProcess) doesn’t make differences in associated variables when accessed from cpu.

adding after my function

imgRes.download(imgRes_n);
imgProc.download(imgProc_n);

the output is

    AFTER
 #########
imgCuda_n dim: 1024 3
imgProc_n dim: 512 1
imgPRes_n dim: 512 3
imgCuda dim: 1024 3
imgProc dim: 512 1
imgRes dim: 512 3

so i want this output using shared memory and without using download

Now I understand. You are expecting the dimensions of the Mat’s to change when you process the GpuMat’s? This won’t happen and has nothing to do with unified memory at all.

The procedure is:

  1. You have a Mat (imgCuda_n) and a GpuMat (imgCuda) pointing at the same memory region (note try this with two Mat’s)
  2. Internally because imgRes is not 512x512 imgRes will alocate new memory of the correct size and point to that on the call to
cv::cuda::resize(imgCuda, imgRes, cv::Size(512, 512), cv::INTER_TAB_SIZE2, cv::INTER_CUBIC);
  1. On the return from opencvProcess() you then have imgRes with a new size pointing to new memory and imgPRes_n the same as before.

I suggest you try the same thing with two Mat objects with manually allocated host memory instead of a GpuMat and a Mat to understand whats going on.

thanks i check this. So to use unified memory for this task (simply resize and change channel color) what are the best practice?

First check what happens with Mats. I think you have a fundamental misunderstanding of how objects work. You can’t alter one object and hope it automatically changes another just because they point at the same memory location.

To confirm if your aim is to have a Mat and a GpuMat point at the same memory location and be able to process the Mat in host functions and the GpuMat in device functions you will have a hard time. This can only work when the functions modify the memory in place and don’t modify any of the object properties, size type etc.

ok, i’ll check what happens with Mats. this was helpfull.