The use of OpenCV's cuda-VideoReader's grab does not contribute to efficiency improvement?

It means that you are not doing anything different to just calling nextFrame() so the grab()/retrieve() combination is uneccessary and redundant.

Its possible for the frame size and CPU you are using that CPU decoding is faster. I would check what the GPU decoder utilization is. If its less than 100% try increaseing minNumDecodeSurfaces futher or just see if

params.minNumDecodeSurfaces=100

makes a difference.

You should also be able to improve the CPU timings by passing in the existing img in the loop to avoid reallocation on every iteration. i.e.

...
            cap.grab()
        ret, _= cap.retrieve(img)
...