OpenCV Python preprocessing strategies for OCR (pytesseract) character recognition

Jokubas_Ziziunas · January 3, 2025, 10:53pm

I wanted to ask what are the most optimal pre-processing techniques for my case in the letters that I would like to read. I am using pytesseract for character recognition, but sometimes my characters are not recognized properly.

Here is a image I am using, but am using more, just not allowed to upload more:

The most common issues are:

5 get recognized as S (but not vice versa)
S gets recognized as O (but not vice versa)
/ gets recognized as I

I have tried multiple techniques, but if one technique fixes an issue, then another issue pops up. The character recognition works most of the time, but it is not consistent, I would say ~80%. I can take a picutre, do the processing and recognition works, then take a new picture in same conditions and the recognition does not work, seems like recognition is within the tolerance of noise

I believe that a large part of issue is that the font is in bold. For example, I did notice that the wider / is, the more likely it is to be recognized as I. I have tried cv2.resize(fx=2, fy=2) + cv2.erode(), but then for some reason I recognized that the thicker the 5 is, the less likely it is to be recognized as S. At the same time , if characters are thicker, or I reduce the threshold in binarization, the hole in 4 gets filled in and causes the problems.

I cannot change the font. I have tried taking picture at various exposures, nothing does seem to fix the core of the issue. I This is the best focus I am able to obtain. I cannot whitelist certain symbols, because both letters and numbers are possible. I do not want to do .replace(‘SX’, ‘5X’) because the point of the check is to validate the that the label has been printer correctly.

Techniques I have tried:

Regular binarization
OTSU binarization
Adaptive thresholding
Resize + erode()
Upscale image with cv2.dnn_superres, kinda better, but too slow, because I have a lot of images to process
Histogram equalization before any of the above

NOTE: I am able to get the solution for sample images, I am unable to get the consistent solution if images slightly vary, I cannot get it to work 100% of the time.

Can someone provide info on how would you go about cleaning up these images

crackwitz · January 4, 2025, 11:18am

none.

and use anything other than tesseract. it’s ancient and obsolete technology. modern OCR wants you to not mess with the picture.

and that’s a crosspost:

Topic		Replies	Views
OpenCV Image PreProcessing For Pytesseract Python ocr	4	1349	February 1, 2022
Obtain only text and numbers Python ocr , imgproc , text , tesseract	0	928	April 10, 2021
Improving the correct rate of PyTesseract OCR Output Python ocr , tesseract , programming	3	2801	November 27, 2022
Pytesseract OCR Python ocr , tesseract , programming	1	430	August 11, 2022
Looking for some help on processing image for OCR Python ocr	1	69	June 30, 2025

OpenCV Python preprocessing strategies for OCR (pytesseract) character recognition

Related topics