Dear all, I am developing an adaptive image preprocessing pipeline for Tesseract OCR. It is working quite fine for very different documents and lighting conditions.
However, I found that for some images the optimal kernel size used for cv2.morphologyEx() is 3x3, for others 5x5 or even 7x7.
Is any of you aware of an algorithm that determines the optimal kernel size (and iteration value) for cv2.morphologyEx() (or erode()+dilate()) in advance? Or afterwards, but strictly before OCR as it is a very expensive operation.
I think of stg similar to cv2.adaptiveThreshold() or autoCanny() that finds the optimal upper and lower parameters using the np.median() of the image.
what do you consider “optimal”?
Under “optimal” set of parameters (kernel size etc.) I mean that for an image, as an OCR input, the OCR gives the least number of errors in the OCR text output.
I use Tesseract and it has a dozen quality criteria regarding the image input. Contrast, vertical resolution, margins, low noise etc. All of these are fine in my case.
However, depending on the input, the lines of the letters are sometimes just a bit too thin (lines not always continuous) , sometimes too thick (neighbouring letters touch each other).
Erode/Dilate can solve these, but I am looking for a solution that can set these set of parameters specific to the image.
if the speed is not the issue you can try 2n+1 predictions with different kernel sizes and accept the result that predicted the most. this is a form of
test time augmentation.
also if I remember correctly tesseract would work best with black text on white background, if that’s not the case for you, you can make the image binary using otsu threshold and calculate total black vs white pixels, if the number of black pixels is more, negate image using binary_not.
Due to the fact Tesseract takes some seconds to finish, almost every other OpenCV function can be regarded as ‘fast’.
Yes, I can run a couple of dilates using different kernel sizes. But then, how can I find out if the lines of the letters have ‘proper’ thickness, that is not too thin (continuous) and not too thick (do not touch each other)? The problem is, all letters are small so I searching for too many small islands would not work. Erode would erase valuable parts of the fonts.