CUDA Fast detector much slower than normal FAST

Thanks. I am in the process of reading the post. Some of the suggestions (not measuring upload and download for example) are originally done, so doesn’t help much but there is one (pre-allocating) that I want to try. Oh, and I am using C++.

The phenomena that CUDA is slower happens both in the Xavier and in my Ubuntu Host that has a GeForce RTX 2080 Mobile