Matchtemplate() with masks on CUDA CPU

Hi,

Does someone know a way how I can run matchtemplate() with masks on my GPU? It works fine when using the CPU. Its just too slow. I don’t see any support for masks on CUDA and I was wondering what people actually do in that situation? I can print the mask area on my template, but it won’t recognise it on images then anymore.

My mask looks like this:

Managed to make it noticeable faster by executing my image batch on multiple CPU threads with parallelStream / fork join pool:
MY_MAP.entrySet().parallelStream().forEach(entry → {

Happy to receive any other hints which can improve the execution time.