Improving the correct rate of PyTesseract OCR Output

j_ton · November 1, 2022, 6:04am

seems resolution of image effect the output is success or not

usually the image’s resolution/quality from production line is like test image 1, instead of change camera quality, is there any way to make success rate higher? like improve code make simple AI to help detect or something? I need a hand thanks.

the demo .py code I found from tutorial

from PIL import Image
import pytesseract
  
img = Image.open('new_003.png')
text = pytesseract.image_to_string(img, lang='eng')
print("size")

print(img.size)

print(text)

(pic) test image 1: OCR-02 — ImgBB

size
(122, 119)

# the output is:
R carac7

(pic) test image 2: I upper the img size/resolution

size
(329, 249)

# the output is:
R1 oun,
2A
R ca7ac7

(pic) test image 3: test1 — ImgBB
this one just for test but is the only one 100% correct

size
(640, 640)

# the output is:
BREAKING THE STATUE

i have always known

i just didn't understand
the inner conflictions
arresting our hands
gravitating close enough
expansive distamce between
i couldn't give you more
but i meant everything
when the day comes

you find your heart

wants something more

than a viece and a part
your life will change
like astatue set free

to walk among us

to created estiny

we didn't break any rules
we didn't make mistakes
making beauty in loving
making lovine for days

SHILOW

I tried to find out/proof the solution can only be the image resolution or there can be other alternative way to solve this issue

crackwitz · November 1, 2022, 4:07pm

welcome.

this is the forum for OpenCV questions. you haven’t argued why your problem involves OpenCV. I think your problem is off-topic for this forum because I only see PIL and tesseract in your question. perhaps Stack Overflow is a better place to get help.

j_ton · November 2, 2022, 12:57am

thanks for ur advice, I was read through (This topic)- Improving Quality of PyTesseract OCR Output , then decide to post this discussion here, I though OCR technology is relate to OpenCV

Gareth_Smith · November 27, 2022, 2:38pm

If you have a known font/s then you can roll you own, non AI, solution, which is what I did to cope with the Autocad font and Tesseract being quite poor at it.

Basically get the bounding boxes for the letters, build a library of tif or png files, one per letter/digit. Extract each letter as it’s own file, all normalised to a known size.

Do again for the doc you want to OCR, and for each letter normalise to same size as before etc (you will need to add some borders to some to make them same size to XOR), go to B+W and then XOR the letter you are guessing against all the possible characters you saved from before, for each try using the function to count the white (or black, depending how you have done it) dots.

Generally lowest number of dots wins.

If you have noisy (eg scanned) images etc, you might need to hand craft a few special cases, eg “B” and “8”.

Just run the same approach to differentiate these cases, but on a sub area of the character relevant, eg for 8 and bottom left 9th or 16th of the character pic, then you should get enough difference to tell.

I was applying this to fairly low quality engineering drawings, with the stupid autocad font that was designed for printers of a type no longer used (that is why is it is stupid, stupid to use, mostly) and ended up with spectacular results, well above 99.9% correct and I wasn’t limited to a document resolution (as you are with some AI) and so on.

Tesseract is the first quick and easy answer to many problems, but I don’t think it is ever the best answer for most problems. It’s amazing for what it is, but if it doesn’t meet your needs after messing with it for a while, there seems no path to custom training or similar.

So as I said, if you have fonts, and they are somewhat limited in variation, this is what worked very well for me. It’s also not as slow as it sounds, the amount of data is low and it gets held in the cache after first pass it seems. I solely used OpenCV for any libraries imported.

If you have handwriting or a bunch of different fonts, then you’ve probably already more or less maxed out apart from some fine tuning.

Topic		Replies	Views
Pytesseract OCR Python ocr , tesseract , programming	1	428	August 11, 2022
Improving Quality of PyTesseract OCR Output Python ocr , imgproc , text , tesseract	2	9976	June 8, 2022
OpenCV Image PreProcessing For Pytesseract Python ocr	4	1332	February 1, 2022
OpenCV Python preprocessing strategies for OCR (pytesseract) character recognition Python ocr , tesseract	1	345	January 4, 2025
Pytesseract identifies "q" as "a" and "i" as "I" Python ocr , tesseract , captcha	1	424	April 14, 2023

Improving the correct rate of PyTesseract OCR Output

Related topics