OpenCV Image PreProcessing For Pytesseract


I have a sample image below where i’m trying to extract the content using pytesseract

I’ve tried pre-processing it in OpenCV first via:

  1. Convert to grayscale
  2. Apply Gaussian Blurring, then Adaptive Thresholding
  3. Apply Dilation

Using Pytesseract, i can extract the text fine apart from those in the Dividend Period column, and that is because the words are not pronounced enough.

Are there other pre-processing techniques that i can use in OpenCV to increase the font weight of that column so i can extract those text to a reasonable degree?

Appreciate if someone can point me in the right direction (better if you can share a sample code), or is this not possible at all?

the very first thing you need to fix is the compression. that text is blurry due to compression.

Hi @crackwitz,

I’ll have to see if I can get better quality/higher resolution of the scanned in image for the above

Assuming I can and the section is no longer as blurry, the font is still relatively light compared to others.

What would you recommend in terms of preprocessing for the image above apart from what I’ve already tried?

related: python 3.x - Pytesseract Extract Light Text From Image - Stack Overflow

what I would recommend? subtract a median blur