Yep, optimisation is a tricky thing in Python. Especially on a simple computer like the Raspberry Pi.
First, consider to use integer operations where possible instead of float32. They are much faster.
An important drawback of Python is that it’s single-threaded - so it will use only one of the four cores of the CPU. C++ is much faster and it’s easier to parallelize (e.g. using TBB library).
These solutions are probably simpler to implement than an OpenGL shader. The Raspberry Pi has no OpenCL support.