Enjoying some AI: Tackling Text Removal at Scale

Ehab Al-Hakawati

Engineering Strategist & AI Researcher in FinTech…

Published Nov 14, 2024

As a senior manager, finding time for hands-on experimentation is rare. But last week, I took on a unique challenge that pushed me back into the trenches of machine learning and image processing. The task? Building software that removes specific text objects from thousands of images—a task similar to watermark removal.

Problem Statement

At first glance, the task appeared straightforward: detect and eliminate specific text objects from images. However, as I delved deeper, it became evident that scaling this solution to thousands of images introduced a multitude of intricate challenges. Ensuring consistency and quality across diverse image contexts required a more nuanced approach than initially anticipated.

First Attempt: Leveraging OCR and OpenCV

My initial strategy involved utilizing Optical Character Recognition (OCR using Tesseract) to detect the text objects, followed by OpenCV’s inpainting capabilities to remove them. The process was straightforward:

Text Detection with OCR: Use OCR to identify and locate text within each image.
Inpainting with OpenCV: Apply inpainting techniques to fill in the detected text regions, aiming to blend them seamlessly with the surrounding background.

However, after several iterations, it became clear that this approach had significant limitations:

OCR Limitations: OCR struggled with the variability in text placements, fonts, sizes, and orientations across the vast dataset. This inconsistency led to unreliable text detection, especially in images with complex backgrounds or unconventional text styles.
Inpainting Challenges: OpenCV’s inpainting, while effective in controlled scenarios, often failed to produce realistic results. The filled regions sometimes appeared blurred or mismatched with the original background textures, detracting from the overall image quality.

These setbacks highlighted the need for a more robust and scalable solution!.

Breaking Down the Problem

Through extensive research and iterative experimentation, I distilled the problem into two primary challenges:

Accurate Text Object Detection:

Diversity in Image Contexts: Images varied widely in terms of background complexity, lighting conditions, and text orientations.
Variability in Text Characteristics: Differences in font styles, sizes, colors, and placements made consistent detection difficult.

Realistic Inpainting Post-Removal:

Seamless Blending: Ensuring that the areas from which text was removed blended naturally with the surrounding pixels.
Avoiding Artifacts: Preventing the emergence of visible traces or a "washed-out" effect that would betray the editing process.

Recognizing these challenges, I opted for the following approach:

Custom AI-Based Detection: Developing a tailored model to accurately detect and localize text objects across the diverse image set.
Advanced Inpainting Techniques: Employing a state-of-the-art inpainting model to ensure that removed text regions were filled in with high realism, maintaining the integrity of the original images.

Solution Strategy:

Step 1: Selecting YOLOv11 for Precise Object Detection

To achieve accurate text object detection, I selected YOLOv11 (You Only Look Once version 11)—a state-of-the-art object detection model renowned for its speed and precision. YOLOv11 excels in real-time detection scenarios, making it ideal for processing large batches of images efficiently.

Key Components of This Step:

Object Detection Fundamentals:

Definition: Object detection involves identifying and locating objects within an image. In this project, the goal was to detect specific text objects that needed removal.
Why YOLOv11: Its ability to balance speed and accuracy, coupled with advancements in its latest iteration, made YOLOv11 a suitable choice for handling the diverse and extensive image dataset.

Annotation Process:

Purpose: Training YOLOv11 requires a labeled dataset where the target objects (in this case, text markers) are manually outlined.
Annotation Tool: I employed Label Studio, a popular annotation tool, to annotate approximately 250 images. Each annotation involved marking the exact position and boundary of the text objects, providing the model with clear examples to learn from.

Dataset Preparation

Splitting Data: The annotated images were divided into training, validation, and testing (70%,15%,15%) sets to ensure the model could generalize well to unseen data.
Configuration File (data.yaml)

train: /path/to/labels/train/images
val: /path/to/labels/valid/images
test: /path/to/labels/test/images

nc: 1
names: ['marker']  # Annotation label for the text objects

This is how my annotation dataset folder looks

Model Training

Simply, here is the code

Recommended by LinkedIn

Domain-Specific Large Vision Models (LVMs) Simplified

Data Science Dojo 1 year ago

The Small AI Models and Tools Making a Big Splash

Arbisoft 6 months ago

From Mind to Model: Stable Diffusion as a Bridge…

ATI Project 8 months ago

!pip3 install ultralytics

from ultralytics import YOLO

# Loading a pretrained model
model = YOLO('yolo11m.pt')

# free up GPU memory
torch.cuda.empty_cache()

# Training the model
model.train(data = '/path/to/labels/data.yaml',
            optimizer = 'auto',
            epochs = 20,
            imgsz = 640,
            batch = 8,
            workers = 4)

The output of this step is a model located here

runs/detect/train/weights/best.pt

Detecting Text Objects and Creating Inpaint Masks

Once the model was trained, the next phase involved using YOLOv11 to detect text objects in new images and create corresponding masks for inpainting. Here's how it was accomplished:

%matplotlib inline
# Loading the best performing model
model = YOLO('runs/detect/train/weights/best.pt')

image = cv2.imread("/path/to/image.jpg")

results = model(image)
detections = results[0].boxes  # Get the bounding boxes from the results
    
# Create a mask for inpainting
mask = np.zeros(image.shape[:2], dtype=np.uint8)  # Same height/width, single channel

# Loop through detected objects and draw mask for inpainting 

for box in detections:

  # Get the bounding box coordinates
   x1, y1, x2, y2 = map(int, box.xyxy[0])  # Convert coordinates to integers

  # Draw a filled rectangle (mask) over the detected object
  mask[y1:y2, x1:x2] = 255  # Set the bounding box region to white in the mask

This code effectively identifies the regions containing text and creates a binary mask highlighting these areas for subsequent inpainting.

Step 2: Seamless Inpainting with LaMa

With the bounding boxes accurately identifying the text regions, the next critical step was to remove these texts in a manner that maintained the natural appearance of the images. For this, I turned to LaMa (Large Mask Inpainting)—a pre-trained inpainting model renowned for its ability to generate highly realistic background textures, effectively eliminating any visible signs of editing.

Why LaMa?

While OpenCV’s inpainting offered basic removal capabilities, LaMa provided superior results by intelligently filling in the masked regions with contextually appropriate textures and colors. This advanced inpainting ensures that the removed areas blend seamlessly with their surroundings, preserving the integrity and aesthetics of the original images.

Challenges Faced:

Implementing LaMa wasn’t without its hurdles. Integrating it into the workflow required meticulous setup and configuration, consuming nearly two days of dedicated effort. The key challenges included:

Repository Integration: Cloning and integrating the LaMa repository into the existing codebase.
Model Configuration: Downloading the latest LaMa model and ensuring compatibility with the project’s requirements.
Dependency Management: Installing necessary dependencies to facilitate smooth operation.
API Integration: Developing a Flask API to handle image processing requests efficiently.

Detailed Implementation Steps

Cloning the LaMa Repository

git clone git@github.com:advimman/lama.git

Append the Lama to the code base

sys.path.append('/path/to/lama')

Explanation: This ensures that Python can locate and import LaMa’s modules, integrating its functionality into the existing project.

Download the latest Lama model

curl -LJO https://huggingface.co/smartywu/big-lama/resolve/main/big-lama.zip

Purpose: Retrieves the most recent LaMa model, which is essential for accurate inpainting.

Installing LaMa dependencies

pip3 install requirements.txt

Integrating LaMa with Flask API

Purpose: To create a scalable and accessible endpoint for processing images, a Flask API was developed. This API handles incoming image URLs, processes them through YOLOv11 and LaMa, and returns the inpainted images.

Below is the complete code that ties together YOLOv11’s detection capabilities with LaMa’s inpainting prowess within a Flask API framework:

import io
import sys
import os
import yaml
import requests
import hashlib
from pathlib import Path

import torch
import cv2
import numpy as np
from PIL import Image
from omegaconf import OmegaConf
from flask import Flask, request, jsonify, send_file
from ultralytics import YOLO

# Add LaMa directory to Python path
sys.path.append('/path/to/lama')

from saicinpainting.training.trainers import load_checkpoint
from saicinpainting.evaluation.data import pad_tensor_to_modulo
from saicinpainting.evaluation.utils import move_to_device

app = Flask(__name__)

# Load LaMa configuration and checkpoint
lama_config = "/path/to/big-lama/config.yaml"
lama_ckpt = "/path/to/big-lama/models/best.ckpt"

# Load LaMa configuration
predict_config = OmegaConf.load(lama_config)
predict_config.model.path = lama_ckpt

# Load training configuration
with open(lama_config, 'r') as f:
    train_config = OmegaConf.create(yaml.safe_load(f))

train_config.training_model.predict_only = True
train_config.visualizer.kind = 'noop'

# Load LaMa model checkpoint
checkpoint_path = os.path.join(
    predict_config.model.path, 'models',
    predict_config.model.checkpoint
)
lama_model = load_checkpoint(
    train_config, checkpoint_path, strict=False, map_location='cpu')

lama_model.freeze()

# Load the trained YOLOv8 model
yolo_model = YOLO('runs/detect/train/weights/best.pt')

# Helper function to dilate mask
def dilate_mask(mask, dilate_factor=15):
    mask = mask.astype(np.uint8)
    mask = cv2.dilate(
        mask,
        np.ones((dilate_factor, dilate_factor), np.uint8),
        iterations=1
    )
    return mask

# Helper function to download image from URL
def download_image(url):
    response = requests.get(url)
    if response.status_code == 200:
        nparr = np.frombuffer(response.content, np.uint8)
        img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
        return img
    else:
        raise ValueError("Image download failed")

# Process image with YOLOv8 and LaMa
def process_image_lama(image):
    results = yolo_model(image)
    detections = results[0].boxes

    # Create mask for inpainting
    mask = np.zeros(image.shape[:2], dtype=np.uint8)
    for box in detections:
        x1, y1, x2, y2 = map(int, box.xyxy[0])
        mask[y1:y2, x1:x2] = 255

    # Perform inpainting using LaMa
    return replace_masked_region_v2(image, mask)

# Replace masked regions using LaMa
def replace_masked_region_v2(img, mask, mod=8, device="cuda", dilate_kernel_size=None):
    device = "cuda" if torch.cuda.is_available() else "cpu"

    assert len(mask.shape) == 2
    if np.max(mask) == 1:
        mask = mask * 255
    img = torch.from_numpy(img).float().div(255.)
    mask = torch.from_numpy(mask).float()

    # Optional: Dilate mask to avoid edge effects
    if dilate_kernel_size is not None:
        mask = dilate_mask(mask, dilate_kernel_size)

    batch = {}
    batch['image'] = img.permute(2, 0, 1).unsqueeze(0)
    batch['mask'] = mask[None, None]
    unpad_to_size = [batch['image'].shape[2], batch['image'].shape[3]]
    batch['image'] = pad_tensor_to_modulo(batch['image'], mod)
    batch['mask'] = pad_tensor_to_modulo(batch['mask'], mod)
    batch = move_to_device(batch, device)
    batch['mask'] = (batch['mask'] > 0) * 1

    batch = lama_model(batch)
    cur_res = batch[predict_config.out_key][0].permute(1, 2, 0)
    cur_res = cur_res.detach().cpu().numpy()

    if unpad_to_size is not None:
        orig_height, orig_width = unpad_to_size
        cur_res = cur_res[:orig_height, :orig_width]

    cur_res = np.clip(cur_res * 255, 0, 255).astype('uint8')
    return cur_res

# Flask route to process images
@app.route('/process', methods=['GET'])
def process():
    image_url = request.args.get('image_url')
    if not image_url:
        return jsonify({"error": "No image URL provided"}), 400

    try:
        # Download image from URL
        image = download_image(image_url)
        
        # Process image with YOLOv11 and LaMa
        processed_image = process_image_lama(image)
        
        # Encode processed image to WebP format
        _, buffer = cv2.imencode('.webp', processed_image, [cv2.IMWRITE_WEBP_QUALITY, 90])
        
        # Send the processed image as a response
        return send_file(io.BytesIO(buffer), mimetype='image/webp')
    
    except Exception as e:
        return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
    app.run(debug=True)

With both YOLOv11 and LaMa integrated, the Flask API serves as a robust interface for processing images. Users can submit image URLs, and the API will return images with the specified text objects seamlessly removed.

Conclusion: Embracing Complexity and Continuous Learning

This endeavor served as a profound reminder that tasks which may initially appear "simple" can unravel into complex challenges when scaled. It underscored the importance of leveraging cutting-edge advancements in AI to navigate and overcome such obstacles effectively.

Moreover, this project reaffirmed the immense value of engaging in hands-on work, even within a leadership role. By immersing myself directly in the technical aspects, I not only addressed a specific problem but also stayed attuned to the rapid advancements in AI and machine learning. This dual perspective—balancing managerial oversight with technical engagement—is crucial for fostering innovation and maintaining a forward-thinking approach in the ever-evolving tech landscape.

5 Comments

Qusai Alhalees

Product Manager | FinTech Innovator | Champion of User-Centric Design and Agile Methodologies | Driving innovative products from concept to market

10mo

Very informative

Waleed Saleh

Senior Software Engineer & Technical Team Lead | Driving High-Performing Teams & Building Scalable Products

10mo

GG Boss 🫡

1 Reaction

Nedal Altiti

Senior Applied AI Engineer - MLOps | LLMOps | GenOps

10mo

Proud to have learned from you!

1 Reaction

Ahmed Wajieh

Head of Quality Engineering | QA Transformation Leader | Driving Engineering Excellence & Automation | ISTQB CTFL · CTAL · CMT

10mo

🔥🔥🔥

1 Reaction

Fahd Mannaa, MSc

Director Of Technology @ Boutiqaat.com

10mo

Awesome

1 Reaction

See more comments

To view or add a comment, sign in

LinkedIn respects your privacy

Enjoying some AI: Tackling Text Removal at Scale

Ehab Al-Hakawati

Engineering Strategist & AI Researcher in FinTech…

Problem Statement

First Attempt: Leveraging OCR and OpenCV

Breaking Down the Problem

Accurate Text Object Detection:

Realistic Inpainting Post-Removal:

Solution Strategy:

Object Detection Fundamentals:

Annotation Process:

Dataset Preparation

Model Training

Recommended by LinkedIn

Detecting Text Objects and Creating Inpaint Masks

Step 2: Seamless Inpainting with LaMa

Why LaMa?

Challenges Faced:

Detailed Implementation Steps

More articles by Ehab Al-Hakawati

Others also viewed

Generative AI Updates (October Week 1, 2024)

AI News Highlights from 18th of April, 2025

From Stealth to #1: Reve Tops AI Image Charts

Transforming Tech: Breakthroughs in AI Models, 3D Content Generation, UI Insights, and Robotics

Engineering Application of Artificial Intelligence & Machine Learning (Part-2)

Introduction of OpenAI SORA, text-to-video AI

OCR and GenAI: Key Trends from H1 2025

AI Photo Restoration Face-Off: GPT-4o vs. Gemini 2.5 Pro vs. Grok-3

Watch Out for GPT-4o's Assumptions and Claude's Workarounds!

If LLM is Based on Text, How Can It Create Images? 🤔🎨

Explore content categories

Problem Statement

First Attempt: Leveraging OCR and OpenCV

Breaking Down the Problem

Accurate Text Object Detection:

Realistic Inpainting Post-Removal:

Solution Strategy:

Object Detection Fundamentals:

Annotation Process:

Dataset Preparation

Model Training

Recommended by LinkedIn

Detecting Text Objects and Creating Inpaint Masks

Step 2: Seamless Inpainting with LaMa

Why LaMa?

Challenges Faced:

Detailed Implementation Steps

More articles by Ehab Al-Hakawati

Databases Don’t Do Perfect: A Story of…

The Evolution of Real-Time…

You Know the Story. An idea, a…

Event-Driven Architecture: What, Why…

Why Senior+ Engineers Must Understand…

Why Senior+ Engineers Must Understand…

From CliQ to MADA: My Own Perspective

AI in Software Development: Reality vs.…

From Mentorship to Friendships:…

In the Eye of the Storm: A Blueprint…

Others also viewed

Generative AI Updates (October Week 1, 2024)

AI News Highlights from 18th of April, 2025

From Stealth to #1: Reve Tops AI Image Charts

Transforming Tech: Breakthroughs in AI Models, 3D Content Generation, UI Insights, and Robotics

Engineering Application of Artificial Intelligence & Machine Learning (Part-2)

Introduction of OpenAI SORA, text-to-video AI

OCR and GenAI: Key Trends from H1 2025

AI Photo Restoration Face-Off: GPT-4o vs. Gemini 2.5 Pro vs. Grok-3

Watch Out for GPT-4o's Assumptions and Claude's Workarounds!

If LLM is Based on Text, How Can It Create Images? 🤔🎨

Explore content categories