4

I have 2 images like shown below. A.png is perfectly read by tesseract but B.png is terribly bad accuracy even though the B.png is similar to A.png. How can I improve the accuracy? I have no idea where to start debugging?

  • A.png

enter image description here

  • B.png

enter image description here

  • Run OCR
# tesseract -v
tesseract 4.1.1-rc2-22-g08899
# tesseract A.png stdout -l jpn --psm 6
Warning: Invalid resolution 0 dpi. Using 70 instead.
第 3 期 決算 公告 令 和 2 年 2 月 7 日
大 阪 市 中 央 区 南 新町 一 丁目 3 番 10 号
株 式 会 社 Link_Mobile

代表 取締 役 佐々 木 勉

貸借 対照 表 の 要旨 (平成 31 年 3 月 31 日 現在 }
# tesseract B.png stdout -l jpn --psm 6
Warning: Invalid resolution 0 dpi. Using 70 instead.
。 人 加計
区 三 6 番 12 号
中 野 駅 前 ビル 5 | 、
am 人 mw
に て
貸借 対照 表 の 要旨 ( 令 和 元 年 11 月 30 日 現在 }

Update 1

Were both scanned using the same scanner, and at the same resolution?

Yes. The images that were originally included in the same PDF were cut out.

Are you taking advantage of any APIs which Tesseract exposes for pre-processing the images before doing OCR?

No. I did not know that. I am checking now about it.

5
  • 1
    Can you tell us more about these two images? Were both scanned using the same scanner, and at the same resolution? Are you taking advantage of any APIs which Tesseract exposes for pre-processing the images before doing OCR? Commented Feb 21, 2020 at 11:46
  • @TimBiegeleisen Hi. "Yes same resolution" and "No, I don't use it". I did not know the API. I am checking now. Commented Feb 21, 2020 at 11:53
  • You should be using it. Some scans won't generate any output unless the image be cleaned up. Commented Feb 21, 2020 at 11:59
  • @TimBiegeleisen Will try it and post the result! Commented Feb 21, 2020 at 12:00
  • @TimBiegeleisen Rescaling image worked on improving! Commented Feb 21, 2020 at 15:14

1 Answer 1

2

It improved. I read "Tesseract documentation" and rescaled the image.

Rescaling Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images. For more information see the FAQ.

  • Rescaled image

enter image description here

  • Run OCR
# tesseract B2.png stdout -l jpn --psm 6
第 54 期 決 算 公 告 _ 令 和 2 年 1 月 29 日
東京 都 中 野 区 中 野 三 丁目 36 番 12 号
中 野 駅 前 ビル 5 F
株 式 会 社 コ ー エ ー テ クニ カ
代表 取締 役 小 空 _ 修
貸借 対照 表 の 要旨 ( 令 和 元 年 11 月 30 日 現在 )
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.