OpenCV supported CUDA calcCovarMatrix

what is a good way to make a CUDA/GPU based version of the calcCovarMatrix function. Doesn’t seem to be available yet.
the C++ version in 4.7.# works great and now we need to go faster. we know how to bake our own simple GPU kernels from scratch and mix/match with OpenCV. but all this fast reduction and thread sync gets a bit much to start from ground zero. so where do people start when backfilling known working C++ versions to CUDA wrapper? I guess, if we make something useful do we contrib it back to openCV community? Thx.

I would try something similar to

There may be more efficient fused ways to calculate the intermediate matrix with the mean subtracted but as most of the computation will be in the calculation of the covariances (gemm) it probably wouldn’t make any significant difference.