OpenCV Image PreProcessing For Pytesseract

Warren_Lee · February 1, 2022, 11:04am

Hi,

I have a sample image below where i’m trying to extract the content using pytesseract

I’ve tried pre-processing it in OpenCV first via:

Convert to grayscale
Apply Gaussian Blurring, then Adaptive Thresholding
Apply Dilation

Using Pytesseract, i can extract the text fine apart from those in the Dividend Period column, and that is because the words are not pronounced enough.

Are there other pre-processing techniques that i can use in OpenCV to increase the font weight of that column so i can extract those text to a reasonable degree?

Appreciate if someone can point me in the right direction (better if you can share a sample code), or is this not possible at all?

crackwitz · February 1, 2022, 12:05pm

the very first thing you need to fix is the compression. that text is blurry due to compression.

Warren_Lee · February 1, 2022, 12:53pm

Hi @crackwitz,

I’ll have to see if I can get better quality/higher resolution of the scanned in image for the above

Assuming I can and the section is no longer as blurry, the font is still relatively light compared to others.

What would you recommend in terms of preprocessing for the image above apart from what I’ve already tried?

crackwitz · February 1, 2022, 1:28pm

related: python 3.x - Pytesseract Extract Light Text From Image - Stack Overflow

crackwitz · February 1, 2022, 1:28pm

what I would recommend? subtract a median blur

Topic		Replies	Views
OpenCV Python preprocessing strategies for OCR (pytesseract) character recognition Python ocr , tesseract	1	386	January 4, 2025
Processing for text extraction Python ocr , text , tesseract	2	1250	July 19, 2022
Obtain only text and numbers Python ocr , imgproc , text , tesseract	0	928	April 10, 2021
Image Grayscale (tesseract OCR) Python ocr , tesseract	7	1696	August 6, 2023
How to remove those pixels around the dates and watermark Python	4	3391	December 27, 2020

OpenCV Image PreProcessing For Pytesseract

Related topics