Open In App

How to Extract Text from Images with Python?

Last Updated : 04 Oct, 2025
Suggest changes
Share
12 Likes
Like
Report

OCR (Optical Character Recognition) is a technique used to convert text from images into editable and searchable digital text. For example, you can scan a printed page and turn it into editable text on your computer. In this article, we’ll use Python and the pytesseract library to extract text from images.

Installation

To enable OCR in Python, we use the pytesseract library:

pip install pytesseract

Note: On Windows, you also need to install the tesseract.exe binary. During installation, you’ll choose (or be given) an install path. Commonly it’s:

C:\Program Files\Tesseract-OCR\tesseract.exe

or

C:\Users\<username>\AppData\Local\Programs\Tesseract-OCR\tesseract.exe

Make sure to update your code with the correct path based on your system.

Steps to Extract Text from Images

1. Import required libraries

from PIL import Image
import pytesseract

2. Set the path to the Tesseract executable

pytesseract.pytesseract.tesseract_cmd = r"C:\Users\<username>\AppData\Local\Programs\Tesseract-OCR\tesseract.exe"

3. Open the image using PIL:

image = Image.open("example_image.png")

4. Convert the image to grayscale to improve OCR accuracy:

gray_image = image.convert("L")

5. Extract text using pytesseract:

text = pytesseract.image_to_string(gray_image)

6. Clean the extracted text by removing unwanted characters (like page-break symbols):

clean_text = text.replace("\x0c", "").strip()
print(clean_text)

Examples

Example 1:

Image for demonstration:

An image of white text with black background

Code:

Python
from PIL import Image
import pytesseract

# Path to tesseract.exe (update if different on your computer)
pytesseract.pytesseract.tesseract_cmd = r"C:\Users\gfg0753\AppData\Local\Programs\Tesseract-OCR\tesseract.exe"

# Open the image
img = Image.open("sample_text.png")

# Convert to grayscale (makes it easier for OCR)
img = img.convert("L")

# Extract text from the image
text = pytesseract.image_to_string(img)

# Remove extra characters and print the text
print(text.replace("\x0c", "").strip())

Output

now children state should after above same long made such
point run take call together few being would walk give

Example 2:

Image for demonstration:

Code:

Python
from PIL import Image
import pytesseract

# Correct path to tesseract.exe on your computer
pytesseract.pytesseract.tesseract_cmd = r"C:\Users\gfg0753\AppData\Local\Programs\Tesseract-OCR\tesseract.exe"

# Path to the image
image_path = r"d.jpg"

# Open the image and convert it to grayscale
img = Image.open(image_path).convert("L")

# Extract text from the image
text = pytesseract.image_to_string(img)

# Clean up unwanted characters and print result
print(text.replace("\x0c", "").strip())

Output

Geeksforgeeks


Explore