I am applying a 5x5 median filter using the cuda library and getting quite poor results. I wrote my own 5x5 function and reduced the time by a quarter.
It seems that the opencv uses the histogram technique (I think) this is fast but maybe isn’t best for cuda which doesn’t like different code paths (i.e. if statements and the like). My method was to code up a sorting network, which has almost no variation in code paths, maybe opencv should try this?
usually with filters, cuda or not, you’ve got smart “good complexity” (e.g. constant time) but slow implementations for huge kernel sizes, and dumb “bad complexity” (n², n³, or worse…) but fast special cases for small kernel sizes. each will trump the other in their intended kernel sizes. a generic implementation would then pick (dispatch to) one or the other, depending on the actual kernel size,
sometimes there is just one implementation, or the decision thresholds are decided on one system and don’t quite fit another system.
since you were able to reduce “the time by a quarter”, maybe you’d like to contribute a special case implementation… that is, if you can beat the wavelet thing at 5x5
I may tidy the and post it here. I admit sorting networks are not very flexible and fit the “small custom kernel” stereotype. But data heavy and code light techniques seems to be a good fit for Cuda.
(Here small are is probably under 9x9.)
p.s. No doubt the minute I post it, someone will point out how wrong it is. I am too old to expect less.
dropping it here won’t do much good, except to get it “out there” at all.
if you think it’s a good idea to incorporate into the library, you could open an issue for discussion leading to a pull request, or jump right to the PR.
if the code for a 5x5 median filter requires 20k of code, maybe I misunderstood what you did there.
I implemented filters for 2x2 to 9x9 median filters, for the 9x9 that’s 700 swaps for the 3x3 that’s 20 swaps. The time to do the all swaps is very fast under cuda as all the cores are always in parallel (but each one doing lots swaps for each pixel). 5x5 is about 100 swaps. These swaps form a sorting network and it is these which take up the lines of code. In actual space, they are byte table, so to do a 9x9 is about a 1.5k table of bytes.
To find the median of four numbers:-
int data [4] = { what ever }
const uint8_t median_network4[] = {
0,2, 1,3, 0,1, 2,3, 1,2, 255 }; // <-- these explode in size as inputs increases
uint * sort = median_network4; // fixed code for all sizes, but iterations increase
while( *sort != 255 )
{
int tmp = data[ *sort ];
data[ *sort ] = max (data[ *sort ], data[ *(sort+1) ] );
data[ *(sort+1) ] = min(tmp, data[ *(sort+1) ] );
sort +=2;
}
int median = (data[2] + data[1])/2;