How to identify and erase dotted lines or convert them to solid lines?

Rohan_Sharma · June 28, 2021, 6:12am

Many a times image sent to OCR containing dotted lines are misinterpreted by the engine as texts. How can we identify the dotted lines specifically in the text and replace them either with blank spaces or solid lines.

berak · June 28, 2021, 6:21am

please show, what you tried, so far (code), example image, thank you.

Rohan_Sharma · June 29, 2021, 12:43pm

Hey Thank You : let me share the link of the exact problem

github.com/tesseract-ocr/tesseract

Inaccurate OCR results for lines with many dots

opened 05:15PM - 06 May 20 UTC

IdiosApps

### Environment * **Tesseract Version**: 4.1.1 * **Platform**: MacOS Catalina… 10.15.4 ### Current Behavior: Few issues with `tesseract -l eng --psm 1` on this image: ![ocr_input](https://user-images.githubusercontent.com/12559216/81207359-5bd16400-8fc5-11ea-8c54-d636b9e4eafe.png) Some dots on lines ignored: `Cheese on Pasta...` Some dots on lines have strange letters, and are incorrectly capatilised: `SAUCE ON PASTA... cccceces cece cesses ces seeses cesses c` Numbers on both sides become strange text: `OW `AMNAURWNP`, `©M~MURDUNBWNHE` Here's the full PSM 1 text: ``` MY PASTA RESTAURANT DISHES Cheese on Pasta... Cheesy Spaghetti... OW AMNAURWNP SPAghe ttn... .ccccececcseseeceeseeceeceesoesee ses sessee ses eesseseesaesaesees eeeesees es SAUCE ON PASTA... cccceces cece cesses ces seeses cesses ces seesescaeseeeeseuueeces anes Mega value cheese o on some Spicy Sauce... esate eeeees Fresh and Tasty handmade assortments (FATS) cscscsocne ANTIP ASTI... eee cee cee cee coe coeese ses couse ses cesses aes cesses caeses ses caeeaescaeeees FreSh SIAW......ccescsseecee cesses coecas cesses see see cusses sue ses aecaecas ces case eenaee sees NOOC1@S... 20. .s. cesses co cesses see cuecoe ces cusses cou cesses ace sue seseas cuecaeens ease senses es ©M~MURDUNBWNHE ``` `tesseract -l eng --psm 12` is better: - no random capitalisation, apart from numbers on both sides turning into capitalised words. - 3 of 9 lines have a single ellipsis, the rest have no dots. Here's the full PSM 12 text: ``` MY PASTA RESTAURANT DISHES Spaghetti Sauce on Pasta Cheese on Pasta... Cheesy Spaghetti... Mega value cheese o on some Spicy Sauce... Fresh and Tasty handmade assortments (FATS) Antipasti Fresh slaw WON DUN BWHN PR Noodles OMAN DY BWN PR ``` ### Expected Behavior: Expect dots to be OCR'd as dots, text output to look like text on input image.

This link contains the exact issue.

I have tried the following codes:

How to convert dashed lines to solid? - OpenCV Q&A Forum

Rohan_Sharma · June 29, 2021, 12:44pm

python - Detect dotted (broken) lines only in an image using OpenCV - Stack Overflow

Some codes that I don’t understand

Remove horizontal dashed lines - ImageMagick

Rohan_Sharma · June 29, 2021, 12:45pm

Remove the dotted line in OCR - Programmer Sought

Topic		Replies	Views
Help me convert this code in C++ to python in open CV Python	3	596	June 29, 2021
Obtain only text and numbers Python ocr , imgproc , text , tesseract	0	928	April 10, 2021
Remove Grid in the Floor Plan Python imgproc	2	1348	April 7, 2023
Vertical dotted line and and text imgproc	3	1070	April 10, 2022
Improve text extraction Python ocr , imgproc , tesseract	3	543	July 20, 2022

How to identify and erase dotted lines or convert them to solid lines?

Related topics