As told by the title, I want to do a filter operation in pixel-wise.
Suppose, for example, the filter size is M-by-M and it would be applied onto the image column-by-column for each line. That means the filter slides alongside the first row and moves to the next row and do it iteratively.
There are several things that you can do, depending on the application:
avoid using Python. Four for loops will take forever to run.
check if the filter is separable. It could reduce the MxM window to a M line and M column filter.
check if you can do the operation using a pyramidal approach. This is more the case of a detection operation. First start with a reduced image size (let’s say by 4) and run a M/4xM/4 filter. Then, continue the operation on the interesting areas on higher resolutions.
Sliding windows are quite easy to parallelize. Check this answer on how to do it simply using the TBB library.
Going a step further, the best solution would be a massive parallelization on GPU. This kind of operation is really where GPUs shine.
Fortunately OpenCV has already implemented this feature on CUDA (Nvidia graphics cards). Check this tutorial on how to do it.