Best way to detect containers and OCR in imperfect real life images?

Hello, this is my first post. I am trying to use OpenCV in Python to prepare images for OCR with Tesseract. I have been struggling for more than 2 weeks to get something working and must now admit that I am in over my head. I am hoping someone can offer general advice about how to approach this, as my whole approach is likely flawed and that’s why I’m having so much trouble.

Here are some details about the application:

  • The images will be all of doors of shipping containers and they could be any color, and the photos could be taken any time of day or night and at any angle (within reason).
  • There will be other text and may be other shipping containers in the image, and I only want to OCR the text if it is on a shipping container, and specifically the container that is in the center of the frame. It may be next to other containers of different color or the same color.

Here is the general outline of how I have been trying to do this:

  1. sample color from the center of the image and use it to apply a filter which will reveal the outline of the container.
  2. de-skew & scale the detected outline to the known dimensions of a shipping container (8.0ft x 8.5ft).
  3. Isolate known regions of interest on the de-skewed image, perform filtering operations to make OCR-able test, perform OCR with Tesseract.

De-skewing & scaling is no problem. If I manually specify the corners of the container in the image I get a good, OCR-able result.

Step 2 is what is giving me so much trouble. Using a color filter I can (usually) get a mask with quite the respectable amorphous blob of white pixels indicating right where the container is. But getting the coordinates of the corners of an amorphous blob is something that until now confounds every technique that I know to attempt.

Also to an extent step 1 gives me trouble. If the container is very close to the same colors as other things around it, my mask color mask approach is no good.

I’ve spent hours upon hours watching tutorial videos, following tutorials found in blogs, and so forth, but the common theme among them is that they’re all using pristine example images from the start.
-Detect a white paper receipt on a black table, easy.
-Detect an airplane silhouetted against a perfect blue sky, using a sample airplane cropped from the very same image, easy.
-OCR black text typed on a white background in MS Paint, easy.
I see no examples of real-world applications like detecting a blue sign at an angle against a blue sky, or OCR-ing spray-stenciled text on a wooden pallet with surface imperfections.

So how far off base am I? Is the color filtering thing a waste of time? What approach would you take for this?

let me be rude here and ask, before I read a single word of that, whether a picture would help get your point across?

okay, OCR… chuck tesseract. that is ancient crusty trash. better stuff existed 10-20 years ago. it’s only popular because it was the first open source OCR and now the name stuck in everyone’s head and nobody even considers anything modern. try “easyocr” or any other library.

secondly, proper OCR, in the age of convolutional neural networks, does not need nor want any preprocessing. any preprocessing you could do will harm recognition results. you’d feed the grayscale/color image to the OCR directly, as is.

ah, welcome back.

use a neural network. don’t mess around with anything simpler, that’ll never lead to usable results.

depending on the goals, you need something to infer the corners of the door (for getting a rectified view of it), and something to infer the locations and classifications of text glyphs (think “object detection” but for glyphs). this could be approached like an instance segmentation task too, to get words/strings as instance masks, so you don’t have to spend too much work figuring out where spaces belong between glyphs.

crackwitz
let me be rude here and ask, before I read a single word of that, whether a picture would help get your point across?

Yes, it really would, but

strantor
Hello, this is my first post.

And I thought I wasn’t allowed any attachments. But I see now that this site treats pictures differently than attachments, and I get one of those per post, so please forgive the punctuated postings to follow.

I’m trying to get from here:

to here:

(ignore weird colors; image is same as above but I forgot to convert BGR)

So that I can OCR (with something other than Tesseract, thanks!)

crackwitz
use a neural network. don’t mess around with anything simpler, that’ll never lead to usable results.

I have zero experience with neural networks. I have peeked at some tensorflow tutorials and it seemed to me like a rabbit hole inside a rabbit hole when I’m already down too many rabbit holes. So I’ve been purposefully avoiding it, trying to force simpler methods to work. But I’ll take your advice and dig deeper. Is tensorflow the way to go or do you have a better recommendation?

Note that above is one of my better results with the color filtering.
Below is a poorer result.

Typical result is somewhere in the middle, but closer to the poor example.

I could tweak the parameters to get a better result but then they wouldn’t work for other images. Using this color filter will mean automating the tweaking process (which I’ve spent considerable time on, with little success) or getting the detection of corners to work with a bad result like I just posted (which I’ve also spent considerable time on).

So, the neural network solution does feel like the way to go, since (if I understand correctly) it will look at the images in more like the way human does, and recognize a container because it’s a container rather than because it’s a “sorta dark blue-ish color (mostly (or not)).” But I will admit I am intimidated, and already suffering mental wounds from this project. If there is a idiot-friendly way to approach neural networks I would greatly prefer it.

there’s pytorch, keras, and tensorflow. and then there are some libraries nobody uses anymore. people use whatever’s handy. pytorch seems more favored by people exploring/researching, tensorflow more favored for engineering/deployment/I don’t know.

I think you could get into deep learning with not too much effort. you’ll want to progress to object detection networks.

the “hello world” of deep learning is a few fully connected layers trained on the MNIST data set. there are a few more examples that show more complex networks. you’ll want to explore (fully) convolutional networks.

object detection networks infer the bounding boxes of things. that’s encoded as position and size of the box, in various mostly interchangeable ways but it’s four scalars per detection.

what you would want is to have it infer eight scalars, i.e. the four points (8 coordinates) on the corners of the face of a container.

perhaps that will be as jittery as the bounding boxes of usual object detection networks… in that case, you’d want the network to emit a response map instead (kinda like semantic segmentation), and find local maxima then. I’d recommend four channels per detection, and a couple of detections in total, so you don’t have to guess which peak belongs to what detection.

you’d need annotated data. you can easily create synthetic/augmented data though. just have a bunch of frontal views and map them to various perspective views… perhaps get someone to write you a little python+pyglet script that renders views of a container (box) with selectable textures, or multiple stacked containers, under a somewhat sensible skybox. for that synthetic data, you know the corners, so no more annotating.

1 Like

@crackwitz Thank you for the pointers. I’m going to take the plunge. I think I’ll start with pytorch since that’s what easyOCR uses.