How to split joint characters in bitmap image using EmguCV 2.9.0 (C#)?

Hello,

I am working on a preprocessing step for OCR where I need to split joint/connected characters inside a bitmap image.

  • Goal:
    If two characters (like IL, TI, or AL) are joined in the image, I want to separate them correctly and get a clean final preprocessing image.

  • Current Setup:

    • Language: C#

    • Library: EmguCV 2.9.0 (wrapper of OpenCV)

    • Input: Bitmap image containing characters

  • What I Tried:

    • Used contour detection to segment characters

    • Applied thresholding + connected components

    • Tried to split wide contours into separate parts

But the issue is that some characters still remain joined (e.g., “TI” , “V2” appears as one contour).
Below i attched sample image.

teach_2.jpeg

Question:
What is the recommended way in EmguCV 2.9.0 (C#) to detect and split these joint/connected characters properly?
Do I need to use projection profiles, watershed segmentation, or some other method?

Any solution or example code in C# (EmguCV 2.9.0) will be very helpful.

Thank you!

that is what they did 50 years ago. pardon the frankness. I’d hate to ruin your plan but no hand-crafted mechanistic ruleset can hack your text to bits while maintaining its readability. you need to have a ML model that can (and is allowed to!) digest strictly more than tiny fragments of the picture. you need to give the model a wider field of view. don’t feed it patches. feed it the entire image. fully convolutional. that’s only my idea of how OCR should work.

the 50 year old approach shouldn’t only be abandoned over the issue you’re working on. it should also be abandoned because “connected components” cannot handle “disjoint” letters, with dots above or anywhere around them, or who knows what other shape they could have. the entire connected components approach was a crutch from the beginning. ever since we had effective CNNs it’s only getting in the way.

if you want to stick to the shredding approach with the hobbled model that can only see a keyhole view of the entire input… still abandon the needless shredding. just give the model the entire Connected Component. it’ll have to learn all kinds of accidental “ligatures”, i.e. it’ll have to be able to infer/emit not just single characters but multiple, or else you need postprocessing to turn whatever symbols/values back into multiple characters. example: "V2" becomes a “ligature”, "M3" another… and the model learns to recognize them as if they were regular letters.

if you have any kind of image resizing in that pipeline that feeds the model, you have to make sure it’s maintaining aspect ratio. if you gave it an "I", a resize that simply stretches the bounding box to a fixed (square?) size will turn that I into a solid black box. aspect ratio must be maintained.

really, if you want to do OCR today, don’t follow decades old principles.

if you want, you can take my post as an admission that I don’t know how to split connected characters in any sensible way that doesn’t require machine learning anyway.

Thanks for sharing your perspective. I agree with your point that traditional connected-components and handcrafted splitting approaches have limitations, especially with ligatures, dots, and joint/disjoint characters.

You mentioned moving away from character-level shredding and instead using a fully convolutional ML model that looks at entire words or connected components, possibly treating joint pairs like “V2” or “M3” as new “ligatures.” That sounds interesting, but I’d like to understand the practical steps better.

Could you clarify a few things?

  1. What kind of CNN/fully convolutional architecture would you suggest for this OCR task?

  2. Should the model be trained directly on full text-line images, word images, or entire connected components?

  3. How would you handle the training data — would I need to explicitly label ligatures like “V2” and “M3”, or can the model learn to split them implicitly?

  4. For aspect ratio preservation during resizing: what preprocessing pipeline would you recommend before feeding images to the network?

Your insights would help me a lot in moving from the classic segmentation approach toward something more robust.

context: