pytesseract using tesseract 4.0 numbers only not working

Question

Any one tried to get numbers only calling the latest version of tesseract 4.0 in python?

The below worked in 3.05 but still returns characters in 4.0, I tried removing all config files but the digits file and still didn't work; any help would be great:

im is an image of a date, black text white background:

import pytesseract
im =  imageOfDate
im = pytesseract.image_to_string(im, config='outputbase digits')
print(im)

Add image to the question for answerers to see your problem. — thewaywewere
– thewaywewere, Commented Oct 8, 2017 at 13:03
I went with stackoverflow.com/questions/9413216/… instead. — Cees Timmerman
– Cees Timmerman, Commented Jun 7, 2019 at 10:01
@CuriousGeorge: Did you find a solution to your upgrade problem? — Jarl
– Jarl, Commented Sep 30, 2019 at 15:52
Upgrading to v4.1.1 did not help me properly. I also had to download the tessdata_fast version of the trainddata files. I am attaching a detailed shell script to install 4.1.1 from the source. — Aritra Roy Gosthipaty
– Aritra Roy Gosthipaty, Commented Jun 16, 2021 at 13:11

thewaywewere · Accepted Answer · 2018-06-27 16:24:45Z

17

You can specify the numbers in the tessedit_char_whitelist as below as a config option.

ocr_result = pytesseract.image_to_string(image, lang='eng', boxes=False, \
           config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

Hope this help.

edited Jun 27, 2018 at 16:24

answered Oct 5, 2017 at 15:38

thewaywewere

8,68611 gold badges46 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jakub Mendyk Over a year ago

This solution doesn't work for tesseract 4.0+. There's an open issue related to this on GitHub: github.com/tesseract-ocr/tesseract/issues/751.

Dmitrii Z. Over a year ago

As Jakub mentioned it won't work with 4.0. Instead there is a separate tessdata file for digits

Alaa M. Over a year ago

I'm looking for OCR for recognizing time. E.g. 11:25 . Adding a colon (:) to the whitelist didn't work. Any ideas?

Robert Harris · Accepted Answer · 2019-03-06 19:31:27Z

11

Using tessedit_char_whitelist flags with pytesseract did not work for me. However, one workaround is to use a flag that works, which is config='digits':

import pytesseract
text = pytesseract.image_to_string(pixels, config='digits')

where pixels is a numpy array of your image (PIL image should also work). This should force your pytesseract into returning only digits. Now, to customize what it returns, find your digits configuration file, on Windows mine was located here:

C:\Program Files (x86)\Tesseract-OCR\tessdata\configs

Open the digits file and add whatever characters you want. After saving and running pytesseract, it should return only those customized characters.

answered Mar 6, 2019 at 19:31

Robert Harris

2491 gold badge4 silver badges8 bronze badges

5 Comments

Yaroslav Dukal Over a year ago

what if I need text and digits ?

Robert Harris Over a year ago

you can put both text and digits in the digits config file. For example, you could put '1234567890abcdefg...' and it will only return those alphanumeric characters.

Ganesh Kharad Over a year ago

Which version are you using ?? the method " config='digits' " doesen't wor for me im usin pytesseract==0.3.0

Ammar H Sufyan Over a year ago

Works with the latest tesseract as of 2020

ircham Over a year ago

config=digits only do the whitelisting for numeric from alphanumeric input. How to treat an image as only numeric instead of alphanumeric, any ideas? Like treat l as one instead of L

Jason Aller · Accepted Answer · 2020-06-02 22:34:48Z

5

You can specify the numbers in the tessedit_char_whitelist as below as a config option.

ocr_result = pytesseract.image_to_string(image, lang='eng',config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

edited Jun 2, 2020 at 22:34

Jason Aller

3,66028 gold badges42 silver badges40 bronze badges

answered Jun 2, 2020 at 21:35

Tejesh Teju

1171 silver badge4 bronze badges

Comments

mhellmeier · Accepted Answer · 2020-03-29 21:24:52Z

2

As you can see in this GitHub issue, the blacklist and whitelist doesn't work with tesseract version 4.0.

There are 3 possible solutions for this problem, as I described in this blog article:

Update tesseract to version > 4.1
Use the legacy mode as described in the answer from @thewaywewere

Create a python function which uses a simple regex to extract all numbers:

def replace_chars(text):
    list_of_numbers = re.findall(r'\d+', text)
    result_number = ''.join(list_of_numbers)
    return result_number

result_number = pytesseract.image_to_string(im)

answered Mar 29, 2020 at 21:24

mhellmeier

2,3401 gold badge24 silver badges38 bronze badges

2 Comments

Doğuş Over a year ago

Thanks! Updating to version 4.1.1 from source has solved the problem. github.com/tesseract-ocr/tesseract/releases

Phyo Arkar Lwin Over a year ago

Bad solution - This is to filter out Text after being detected , totally wrong way.

Collectives™ on Stack Overflow

pytesseract using tesseract 4.0 numbers only not working

4 Answers 4

3 Comments

5 Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

5 Comments

Comments

2 Comments

Linked

Related