there are too many moving parts here.
you need to make the first steps. swap/exclude individual parts to narrow the problem space down. you know, standard problem solving.
I have no appreciable experience with java or CUDA, let alone both combined. do not expect further responses from me relating to this thread/problem.