OpenCV Python preprocessing strategies for OCR (pytesseract) character recognition

I wanted to ask what are the most optimal pre-processing techniques for my case in the letters that I would like to read. I am using pytesseract for character recognition, but sometimes my characters are not recognized properly.

Here is a image I am using, but am using more, just not allowed to upload more:

The most common issues are:

  • 5 get recognized as S (but not vice versa)
  • S gets recognized as O (but not vice versa)
  • / gets recognized as I

I have tried multiple techniques, but if one technique fixes an issue, then another issue pops up. The character recognition works most of the time, but it is not consistent, I would say ~80%. I can take a picutre, do the processing and recognition works, then take a new picture in same conditions and the recognition does not work, seems like recognition is within the tolerance of noise

I believe that a large part of issue is that the font is in bold. For example, I did notice that the wider / is, the more likely it is to be recognized as I. I have tried cv2.resize(fx=2, fy=2) + cv2.erode(), but then for some reason I recognized that the thicker the 5 is, the less likely it is to be recognized as S. At the same time , if characters are thicker, or I reduce the threshold in binarization, the hole in 4 gets filled in and causes the problems.

I cannot change the font. I have tried taking picture at various exposures, nothing does seem to fix the core of the issue. I This is the best focus I am able to obtain. I cannot whitelist certain symbols, because both letters and numbers are possible. I do not want to do .replace(‘SX’, ‘5X’) because the point of the check is to validate the that the label has been printer correctly.

Techniques I have tried:

  • Regular binarization
  • OTSU binarization
  • Adaptive thresholding
  • Resize + erode()
  • Upscale image with cv2.dnn_superres, kinda better, but too slow, because I have a lot of images to process
  • Histogram equalization before any of the above

NOTE: I am able to get the solution for sample images, I am unable to get the consistent solution if images slightly vary, I cannot get it to work 100% of the time.

Can someone provide info on how would you go about cleaning up these images

none.

and use anything other than tesseract. it’s ancient and obsolete technology. modern OCR wants you to not mess with the picture.

and that’s a crosspost: