Recognizing a grid of letters from a phone display

Hello everybody!

I want to build myself a program to solve a HASHTAG riddle. It is like the Wordle game, or mastermind, wrapped into one - and then you solve four of them in a grid together.

This is what it looks like:

Solving the puzzle is done already. If I enter the 16 letters and colors my code solves the puzzle in a few seconds, referencing a dictionary to find legal words. The riddle there is in German language btw., should you try to solve it yourself!

Of course I do not want to enter that information manually. I want to hold my phone in front of my laptop camera and see the magic. This is where OpenCV enters the stage.

The picture above was taken by said laptop camera with these lines of code (this is C#)

private VideoCapture _capture;
_capture = new VideoCapture(0);
_capture.Set(VideoCaptureProperties.FrameWidth, 2048);
_capture.Set(VideoCaptureProperties.FrameHeight, 1536);
and
public Mat frame = new Mat();
_capture.Read(frame);

I then process the Mat object using

Cv2.CvtColor(input, gray, ColorConversionCodes.BGR2GRAY);
Cv2.GaussianBlur(gray, gray, new Size(3, 3), 0);

I build thresholds using

Cv2.AdaptiveThreshold(gray, thresh, 255, AdaptiveThresholdTypes.GaussianC, ThresholdTypes.BinaryInv, 11, 2);

I can provide pictures of the gray blurred image and the thresholded one, yet as a new user I am allowed only one media per post. I might reply to myself or to your answers with those other pictures.

Finally, I do find contours and ApproxPolyDP and filter for 4-corner objects, 500 pixel or more in size, and almost square in aspect ratio.

        var contours = new Point[][] { };
        HierarchyIndex[] hierarchy;
        Cv2.FindContours(thresh, out contours, out hierarchy, RetrievalModes.External, ContourApproximationModes.ApproxSimple);

        var candidates = new List<Rect>();
        foreach (var contour in contours) {
            var approx = Cv2.ApproxPolyDP(contour, 0.02 * Cv2.ArcLength(contour, true), true);
            if (approx.Length == 4 && Cv2.ContourArea(approx) > 500) {
                var rect = Cv2.BoundingRect(approx);
                float ratio = (float)rect.Width / rect.Height;
                if (ratio > 0.75 && ratio < 1.25)
                    candidates.Add(rect);
            }
        }

This works somewhat. I did get up to 12 boxes of the 16 recognized. With the example picture it caught only one of them, see the coloured picture with the green rectangle.

Don’t get me wrong - I am amazed what OpenCV delivers here with me giving 10 commands. I am stunned and shocked how powerful this is.

Now I need to point out, the code above is 99.9% ChatGPT created. I do have a good understanding what it is doing, yet I am confident there are better options available in OpenCV to create better results.

Can you help me find those? What I thought of so far:

  • Increase resolution of webcam picture (if possible, need to check hardware)
  • Lightning correction and colour signal amplifing of webcam picture
  • Fiddling with the parameters of the gaussian blur or the threshold detection
  • Rotating or perspective transformation to get a flatter smartphone screen reading
  • Different methods of image processing between the steps I have so far
  • Do the complete image processing workflow not once on one picture, but continuously. Merge together the results, like top 8 in first try, left 5 in second, nothing in third, nothing in fourth, right 8 boxes on fifth try and so on

The threshold picture looks so promising to me, it has clearly worked out the 16 squares and the gaps between them as thresholds. This is so close. I am confident, this must be possible, right?

Thanks for any and all input - I love playing around with this toolkit!

In answer to myself, this is what I have after AdaptiveThreshold and before FindContours:

yeah forget all of that.

throw OCR at it directly, or at least a text detection model. opencv comes with a text detection model. they call it “EAST” for some reason.

it should give you bounding boxes, not necessarily recognitions yet. it should give you boxes for individual letters, not trying to group them into words.

then you only have to figure out how the detection boxes match to the expected grid. grid recovery from a set of detections takes a bit of programming. you’d need to determine the grid spacing, then the major axes. I have some ideas for that, didn’t try them out yet. grid recovery has applications in other situations, e.g. camera calibration. one of these days I might be moved to actually get that done as a proof of concept, maybe even contribute it to opencv (I hate C++).