Improving the correct rate of PyTesseract OCR Output

seems resolution of image effect the output is success or not

usually the image’s resolution/quality from production line is like test image 1, instead of change camera quality, is there any way to make success rate higher? like improve code make simple AI to help detect or something? I need a hand thanks.

the demo .py code I found from tutorial

from PIL import Image
import pytesseract
  
img = Image.open('new_003.png')
text = pytesseract.image_to_string(img, lang='eng')
print("size")

print(img.size)

print(text)

size
(122, 119)

# the output is:
R carac7

  • (pic) test image 2: I upper the img size/resolution
size
(329, 249)

# the output is:
R1 oun,
2A
R ca7ac7

  • (pic) test image 3: test1 — ImgBB
    this one just for test but is the only one 100% correct
size
(640, 640)

# the output is:
BREAKING THE STATUE

i have always known

i just didn't understand
the inner conflictions
arresting our hands
gravitating close enough
expansive distamce between
i couldn't give you more
but i meant everything
when the day comes

you find your heart

wants something more

than a viece and a part
your life will change
like astatue set free

to walk among us

to created estiny

we didn't break any rules
we didn't make mistakes
making beauty in loving
making lovine for days

SHILOW

I tried to find out/proof the solution can only be the image resolution or there can be other alternative way to solve this issue

welcome.

this is the forum for OpenCV questions. you haven’t argued why your problem involves OpenCV. I think your problem is off-topic for this forum because I only see PIL and tesseract in your question. perhaps Stack Overflow is a better place to get help.

thanks for ur advice, I was read through (This topic)- Improving Quality of PyTesseract OCR Output , then decide to post this discussion here, I though OCR technology is relate to OpenCV

If you have a known font/s then you can roll you own, non AI, solution, which is what I did to cope with the Autocad font and Tesseract being quite poor at it.

Basically get the bounding boxes for the letters, build a library of tif or png files, one per letter/digit. Extract each letter as it’s own file, all normalised to a known size.

Do again for the doc you want to OCR, and for each letter normalise to same size as before etc (you will need to add some borders to some to make them same size to XOR), go to B+W and then XOR the letter you are guessing against all the possible characters you saved from before, for each try using the function to count the white (or black, depending how you have done it) dots.

Generally lowest number of dots wins.

If you have noisy (eg scanned) images etc, you might need to hand craft a few special cases, eg “B” and “8”.

Just run the same approach to differentiate these cases, but on a sub area of the character relevant, eg for 8 and bottom left 9th or 16th of the character pic, then you should get enough difference to tell.

I was applying this to fairly low quality engineering drawings, with the stupid autocad font that was designed for printers of a type no longer used (that is why is it is stupid, stupid to use, mostly) and ended up with spectacular results, well above 99.9% correct and I wasn’t limited to a document resolution (as you are with some AI) and so on.

Tesseract is the first quick and easy answer to many problems, but I don’t think it is ever the best answer for most problems. It’s amazing for what it is, but if it doesn’t meet your needs after messing with it for a while, there seems no path to custom training or similar.

So as I said, if you have fonts, and they are somewhat limited in variation, this is what worked very well for me. It’s also not as slow as it sounds, the amount of data is low and it gets held in the cache after first pass it seems. I solely used OpenCV for any libraries imported.

If you have handwriting or a bunch of different fonts, then you’ve probably already more or less maxed out apart from some fine tuning.