I wanted to ask what are the most optimal pre-processing techniques for my case in the letters that I would like to read. I am using pytesseract for character recognition, but sometimes my characters are not recognized properly.
Here is a image I am using, but am using more, just not allowed to upload more:
The most common issues are:
- 5 get recognized as S (but not vice versa)
- S gets recognized as O (but not vice versa)
- / gets recognized as I
I have tried multiple techniques, but if one technique fixes an issue, then another issue pops up. The character recognition works most of the time, but it is not consistent, I would say ~80%. I can take a picutre, do the processing and recognition works, then take a new picture in same conditions and the recognition does not work, seems like recognition is within the tolerance of noise
I believe that a large part of issue is that the font is in bold. For example, I did notice that the wider / is, the more likely it is to be recognized as I. I have tried cv2.resize(fx=2, fy=2) + cv2.erode(), but then for some reason I recognized that the thicker the 5 is, the less likely it is to be recognized as S. At the same time , if characters are thicker, or I reduce the threshold in binarization, the hole in 4 gets filled in and causes the problems.
I cannot change the font. I have tried taking picture at various exposures, nothing does seem to fix the core of the issue. I This is the best focus I am able to obtain. I cannot whitelist certain symbols, because both letters and numbers are possible. I do not want to do .replace(‘SX’, ‘5X’) because the point of the check is to validate the that the label has been printer correctly.
Techniques I have tried:
- Regular binarization
- OTSU binarization
- Adaptive thresholding
- Resize + erode()
- Upscale image with cv2.dnn_superres, kinda better, but too slow, because I have a lot of images to process
- Histogram equalization before any of the above
NOTE: I am able to get the solution for sample images, I am unable to get the consistent solution if images slightly vary, I cannot get it to work 100% of the time.
Can someone provide info on how would you go about cleaning up these images