Improving Quality of PyTesseract OCR Output

Hello,
I am trying to extract and validate numbers from an LCD display using OpenCV and PyTesseract. However, for my particular use, I need as close to 100% accuracy & consistency as possible. After attempting multiple different approaches over the past few days, the best I’ve been able to achieve is around 98% accuracy (of around 1000 extractions). For the last 2% of the time, the extraction either fails, or an incorrect character (for instance confusing “8” and “3” with each other) is returned.
I will illustrate my entire process below, but my question is simple:
Is it possible to extract values from an LCD display with ~99.5% or higher accuracy? If so, what is the best approach to doing so?

My entire process thus far:
test_setup_image_1_screenshot

  1. In this picture, the top left, “Frame,” is the image captured by the VideoCapture object (USB webcam).
  2. The first step is locating the LCD display’s screen. Through a locateDisplayFunction() which happens prior to the code shown below, the “Frame 0” image is produced. A green outline is drawn on “Frame” in order to visualize what part was cut out.
  3. From here, a specified ROI is chosen. This ROI is the right side of one of the 8 data cells seen in the first two frames. The ROI is decided based on a hardcoded ID set explicitly in each of my test cases.
  4. After cropping to our specific ROI, a series of operations are performed. The image is converted to grayscale, all pixels except near-white RGBA values are converted to black (to make the image a true black/white image instead of grayscale), the image is enlarged roughly 3x, and a slight blur is applied to smooth the edges since the image has been upscaled so heavily. All of these produce what is seen in “Frame 1” at the top right.
  5. I’ve seen in a few different places that PyTesseract prefers black text on a white background. Thus, “Frame 2” is an inverted copy of “Frame 1.”
  6. Finally, in order to search for text, I look at the contours. However I want the entire number to return as a whole string, rather than each character being extracted individually. To do this, I heavily dilate the text in order to combine adjacent text contours. This can be seen in “Frame 3” on the bottom right. The rectangle seen in “Frame 1” is drawn around the contour found in “Frame 3.” The PyTesseract text extraction looks for text inside of that rectangle.

A generic list of solutions I’ve attempted to improve the output quality & consistency:

  • Limited ambient light by placing entire setup (display and camera) inside of a closed, dark container.
  • Tried other --psm modes (0-13). I have tried all of them extensively. I found that --psm 6, 7, and 8 work the best.
  • Scaled ROI up/down according to recommendations from other developers
  • Tested with various dpi values using -config option in PyTesseract’s “image_to_string()” function.
  • Captured multiple images and layered them with a bitwise or operation to take the absolute sum of the images.
  • Extracted a Pandas DataFrame using PyTesseract’s “image_to_data()” from multiple images taken rapidly and comparing the confidence intervals of each. Then throwing out any below X% confidence where X is a set threshold (60-80% maybe?).

I have attached my function which performs the processing as it currently stands below. It is important to note that this is after attempting dozens of different solutions and applying a whole host of blurs, erosions & dilations, and thresholds; consequently, the code is fairly messy and extensive.
There is also a determineROICoords() function which I have not added the code from. This function determines which region of the image is focused on. The data is formatted in a 4x2 table. This function simply selects which cell to look at.

# function that takes in a frame (img) and extracts text from it
def processImage(img, dpid):
    # access tesseract's installed location on Pi
    pytesseract.pytesseract.tesseract_cmd = r"/usr/bin/tesseract"
    
    # creates an empty list to store all extracted text
    extractedText = []
    
    # creates a list to store all frames returned by the function
    allFrames = [img] # frame 0
        
    # convert numpy array image to PIL image in order to crop + scale it easier
    pil_img = Image.fromarray(img)
    width, height = pil_img.size
    
    # filter all colors except white
    pil_img = pil_img.convert("RGBA")
    pixdata = pil_img.load()
    
    # converts all pixels that aren't white (< RGBA(250, 250, 250, 255) to black)
    for y in range(pil_img.size[1]):
        for x in range(pil_img.size[0]):
            if pixdata[x, y] <= (250, 250, 250, 255):
                pixdata[x, y] = (0, 0, 0, 255)
            # end of inner if statement
        # end of inner for loop
    # end of outer for loop
    
    # update width & height. The display layout 8 is similar to a 4x2 grid with margins above and below, so we make a 6x2 grid (rowxcol).
    width_scale = (1/2)
    height_scale = (1/6)
    width = int(width * width_scale)
    height = int(height * height_scale)

    # calls function that crops the image depending on what zone (first parameter) we're looking for. This heavily depends on camera position.
    crop_coords = determineROICoords(dpid, width, height)
    pil_cropped = pil_img.crop(crop_coords)
    
    # resize so that it is larger for display/development purposes. Also helps maintain similar aspect ratio when extracting text.
    resize_w_scale = 1
    resize_h_scale = 3
    scaled_coords = (resize_w_scale * width, resize_h_scale * height)
    pil_cropped = pil_cropped.resize(scaled_coords)
    
    # convert cropped, resized PIL image back into grayscale numpy array (uint8)
    img_cropped = numpy.array(pil_cropped)
    img_cropped = cv2.cvtColor(img_cropped, cv2.COLOR_BGR2GRAY)
    img_cropped = numpy.uint8(img_cropped)
    
    # eliminate any extra white/gray "dots" (noise) scattered throughout
    otsu = cv2.threshold(img_cropped, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    kernel = numpy.ones((2, 1), dtype=numpy.uint8)
    erosion = cv2.erode(otsu, kernel, iterations = 1)
    kernel = numpy.ones((1, 2), dtype=numpy.uint8)
    erosion = erosion + cv2.erode(otsu, kernel, iterations = 1)
    kernel = numpy.ones((3, 3), dtype=numpy.uint8)
    dilated = cv2.dilate(erosion, kernel, iterations = 1)
    mask = dilated / 255
    img_cropped = otsu * mask
    
    # converts back to numpy array (uint8) again and applies a small blur to blend edges
    img_cropped = numpy.uint8(img_cropped)
    img_cropped = cv2.blur(img_cropped, (2, 2))
    
    # adds an initial simple thresh that makes all light gray values white (eliminates some noise)
    _, img_cropped_thresh = cv2.threshold(img_cropped, 10, 255, cv2.THRESH_BINARY)
    img_cropped = cv2.blur(img_cropped_thresh, (2, 2))
    
    # apply simple threshold to img_cropped_thresh (2nd threshold that is purely black/white split evenly)
    _, thresh = cv2.threshold(img_cropped_thresh, 127, 255, cv2.THRESH_BINARY)    
    gray = img_cropped
    
    # dilate the image to combine adjacent text contours (prevents every single individual contour from creating a new rectangle)
    kernel = numpy.ones((2, 2), numpy.uint8)
    thresh = cv2.dilate(gray, kernel, iterations = 2)
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
    dilate = cv2.dilate(thresh, kernel, iterations = 8)
    
    # find the contours, highlight the text areas, and extract our ROIs
    cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    if len(cnts) == 2:
        cnts = cnts[0]
    else:
        cnts = cnts[1]
    # end of if statement
    
    # inverts frame with text extraction runs against (so it's black text on white background)
    _, thresh = cv2.threshold(thresh, 127, 255, cv2.THRESH_BINARY_INV)
    
    # loop through all of the contours
    for cnt in cnts:
        # get the total area of the current contour
        area = cv2.contourArea(cnt)
        
        # ignore contours below a certain size
        minContourSize = 4000
        if (area > minContourSize):
            # get borders of current contour
            x, y, w, h = cv2.boundingRect(cnt)
            
            # draw a rectangle around current contour on the inputted img (BGR)
            rect = cv2.rectangle(img_cropped_thresh, (x, y), (x + w, y + h), (255, 255, 255), 1)
            
            # targets or sets the ROI so we only extract text from this contour
            roi = thresh[y:y + h, x:x + w]
            
            # extract text. -psm 6 works best, 8, 7, & 3 work decently well
            #text = pytesseract.image_to_string(roi, lang='eng', config="--psm 6 -c tessedit_char_whitelist=0123456789.-N", timeout = 5.0)
            text = pytesseract.image_to_string(roi, lang='eng', config="--psm 7 --dpi 500 -c tessedit_char_whitelist=0123456789.-N", timeout = 5.0)  # testing
            text = text[:-1]   # the [:-1] gets rid of newline character that is added automatically
            if (text == None):
                processFailedFrame(extractedText, roi)  # calls function which processes failed frames
            else:
                # ensures that resulting text contains alpha-numeric characters
                if (any(c.isalpha() for c in text)) or (any(c.isnumeric() for c in text)):
                    extractedText.append(text)
                else:
                    #print("No alpha-numeric characters were extracted.")
                    processFailedFrame(extractedText, thresh)  # calls function which processes failed frames
                # end of innermost if statement
            # end of inner if statement
        # end of outer if statement
    # end of for loop
    
    # add all frames to list
    allFrames.append(img_cropped_thresh)  # frame 1
    allFrames.append(thresh)  # frame 2
    allFrames.append(dilate)  # frame 3
    
    # return all frames and extracted text
    return allFrames, extractedText
# end of processImage() function

Lastly, for some reason I am not able to extract 0. My tests have a ~98% success rate as stated above, with exception to 0. When a value of “0” is shown, the setup fails 100% of the time without exception.

Any help that can be provided is greatly appreciated. If more information is required, please let me know and I will do my best to get it to you.
Thanks

ok so I looked at the photograph of your screen, and nothing further.

first off: for screenshots, use the “Print Screen” key on your keyboard.

the camera appears to be very low resolution, as evidenced by the obvious blocky pixels in the closeups of those numbers. get the maximum resolution from your camera.

ambient light can be bothersome but that can be subtracted. reflections are really bothersome. your black box is a very good idea.

the picture is taken at an angle. use “homography” and warpPerspective to get a “topdown” view of the screen, like so (this looks awful because it’s a photo of a screen of a photo of a screen).

image

the screen always has the same layout, right? so you should manually define the regions containing text, and you always extract the same regions, out of the warped-topdown view.

image

Yes, obviously you’d need better photos than what I supplied. My apologies I’m not sure how I didn’t realize that before posting the question. I have gone through an updated the image with a much clearer screenshot of the test setup.
Thank you for the “homography” idea. I recognized that this was an issue, but only put around 20 minutes into testing solutions for this before moving on. I will go back and attempt to adjust the input image to be more rectangular and not skewed.
As for your question about the layout, yes it will remain the same. The labels & values will change (although I don’t care about the labels, only the values) but their position should remain consistent within a very small margin or error. Consequently, manually defining the regions is exactly what I do.