Enjoying some AI: Tackling Text Removal at Scale
As a senior manager, finding time for hands-on experimentation is rare. But last week, I took on a unique challenge that pushed me back into the trenches of machine learning and image processing. The task? Building software that removes specific text objects from thousands of images—a task similar to watermark removal.
Problem Statement
At first glance, the task appeared straightforward: detect and eliminate specific text objects from images. However, as I delved deeper, it became evident that scaling this solution to thousands of images introduced a multitude of intricate challenges. Ensuring consistency and quality across diverse image contexts required a more nuanced approach than initially anticipated.
First Attempt: Leveraging OCR and OpenCV
My initial strategy involved utilizing Optical Character Recognition (OCR using Tesseract) to detect the text objects, followed by OpenCV’s inpainting capabilities to remove them. The process was straightforward:
However, after several iterations, it became clear that this approach had significant limitations:
These setbacks highlighted the need for a more robust and scalable solution!.
Breaking Down the Problem
Through extensive research and iterative experimentation, I distilled the problem into two primary challenges:
Accurate Text Object Detection:
Realistic Inpainting Post-Removal:
Recognizing these challenges, I opted for the following approach:
Solution Strategy:
Step 1: Selecting YOLOv11 for Precise Object Detection
To achieve accurate text object detection, I selected YOLOv11 (You Only Look Once version 11)—a state-of-the-art object detection model renowned for its speed and precision. YOLOv11 excels in real-time detection scenarios, making it ideal for processing large batches of images efficiently.
Key Components of This Step:
Object Detection Fundamentals:
Annotation Process:
Dataset Preparation
train: /path/to/labels/train/images
val: /path/to/labels/valid/images
test: /path/to/labels/test/images
nc: 1
names: ['marker'] # Annotation label for the text objects
This is how my annotation dataset folder looks
Model Training
Simply, here is the code
Recommended by LinkedIn
!pip3 install ultralytics
from ultralytics import YOLO
# Loading a pretrained model
model = YOLO('yolo11m.pt')
# free up GPU memory
torch.cuda.empty_cache()
# Training the model
model.train(data = '/path/to/labels/data.yaml',
optimizer = 'auto',
epochs = 20,
imgsz = 640,
batch = 8,
workers = 4)
The output of this step is a model located here
runs/detect/train/weights/best.pt
Detecting Text Objects and Creating Inpaint Masks
Once the model was trained, the next phase involved using YOLOv11 to detect text objects in new images and create corresponding masks for inpainting. Here's how it was accomplished:
%matplotlib inline
# Loading the best performing model
model = YOLO('runs/detect/train/weights/best.pt')
image = cv2.imread("/path/to/image.jpg")
results = model(image)
detections = results[0].boxes # Get the bounding boxes from the results
# Create a mask for inpainting
mask = np.zeros(image.shape[:2], dtype=np.uint8) # Same height/width, single channel
# Loop through detected objects and draw mask for inpainting
for box in detections:
# Get the bounding box coordinates
x1, y1, x2, y2 = map(int, box.xyxy[0]) # Convert coordinates to integers
# Draw a filled rectangle (mask) over the detected object
mask[y1:y2, x1:x2] = 255 # Set the bounding box region to white in the mask
This code effectively identifies the regions containing text and creates a binary mask highlighting these areas for subsequent inpainting.
Step 2: Seamless Inpainting with LaMa
With the bounding boxes accurately identifying the text regions, the next critical step was to remove these texts in a manner that maintained the natural appearance of the images. For this, I turned to LaMa (Large Mask Inpainting)—a pre-trained inpainting model renowned for its ability to generate highly realistic background textures, effectively eliminating any visible signs of editing.
Why LaMa?
While OpenCV’s inpainting offered basic removal capabilities, LaMa provided superior results by intelligently filling in the masked regions with contextually appropriate textures and colors. This advanced inpainting ensures that the removed areas blend seamlessly with their surroundings, preserving the integrity and aesthetics of the original images.
Challenges Faced:
Implementing LaMa wasn’t without its hurdles. Integrating it into the workflow required meticulous setup and configuration, consuming nearly two days of dedicated effort. The key challenges included:
Detailed Implementation Steps
git clone git@github.com:advimman/lama.git
sys.path.append('/path/to/lama')
Explanation: This ensures that Python can locate and import LaMa’s modules, integrating its functionality into the existing project.
curl -LJO https://huggingface.co/smartywu/big-lama/resolve/main/big-lama.zip
Purpose: Retrieves the most recent LaMa model, which is essential for accurate inpainting.
pip3 install requirements.txt
Purpose: To create a scalable and accessible endpoint for processing images, a Flask API was developed. This API handles incoming image URLs, processes them through YOLOv11 and LaMa, and returns the inpainted images.
Below is the complete code that ties together YOLOv11’s detection capabilities with LaMa’s inpainting prowess within a Flask API framework:
import io
import sys
import os
import yaml
import requests
import hashlib
from pathlib import Path
import torch
import cv2
import numpy as np
from PIL import Image
from omegaconf import OmegaConf
from flask import Flask, request, jsonify, send_file
from ultralytics import YOLO
# Add LaMa directory to Python path
sys.path.append('/path/to/lama')
from saicinpainting.training.trainers import load_checkpoint
from saicinpainting.evaluation.data import pad_tensor_to_modulo
from saicinpainting.evaluation.utils import move_to_device
app = Flask(__name__)
# Load LaMa configuration and checkpoint
lama_config = "/path/to/big-lama/config.yaml"
lama_ckpt = "/path/to/big-lama/models/best.ckpt"
# Load LaMa configuration
predict_config = OmegaConf.load(lama_config)
predict_config.model.path = lama_ckpt
# Load training configuration
with open(lama_config, 'r') as f:
train_config = OmegaConf.create(yaml.safe_load(f))
train_config.training_model.predict_only = True
train_config.visualizer.kind = 'noop'
# Load LaMa model checkpoint
checkpoint_path = os.path.join(
predict_config.model.path, 'models',
predict_config.model.checkpoint
)
lama_model = load_checkpoint(
train_config, checkpoint_path, strict=False, map_location='cpu')
lama_model.freeze()
# Load the trained YOLOv8 model
yolo_model = YOLO('runs/detect/train/weights/best.pt')
# Helper function to dilate mask
def dilate_mask(mask, dilate_factor=15):
mask = mask.astype(np.uint8)
mask = cv2.dilate(
mask,
np.ones((dilate_factor, dilate_factor), np.uint8),
iterations=1
)
return mask
# Helper function to download image from URL
def download_image(url):
response = requests.get(url)
if response.status_code == 200:
nparr = np.frombuffer(response.content, np.uint8)
img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
return img
else:
raise ValueError("Image download failed")
# Process image with YOLOv8 and LaMa
def process_image_lama(image):
results = yolo_model(image)
detections = results[0].boxes
# Create mask for inpainting
mask = np.zeros(image.shape[:2], dtype=np.uint8)
for box in detections:
x1, y1, x2, y2 = map(int, box.xyxy[0])
mask[y1:y2, x1:x2] = 255
# Perform inpainting using LaMa
return replace_masked_region_v2(image, mask)
# Replace masked regions using LaMa
def replace_masked_region_v2(img, mask, mod=8, device="cuda", dilate_kernel_size=None):
device = "cuda" if torch.cuda.is_available() else "cpu"
assert len(mask.shape) == 2
if np.max(mask) == 1:
mask = mask * 255
img = torch.from_numpy(img).float().div(255.)
mask = torch.from_numpy(mask).float()
# Optional: Dilate mask to avoid edge effects
if dilate_kernel_size is not None:
mask = dilate_mask(mask, dilate_kernel_size)
batch = {}
batch['image'] = img.permute(2, 0, 1).unsqueeze(0)
batch['mask'] = mask[None, None]
unpad_to_size = [batch['image'].shape[2], batch['image'].shape[3]]
batch['image'] = pad_tensor_to_modulo(batch['image'], mod)
batch['mask'] = pad_tensor_to_modulo(batch['mask'], mod)
batch = move_to_device(batch, device)
batch['mask'] = (batch['mask'] > 0) * 1
batch = lama_model(batch)
cur_res = batch[predict_config.out_key][0].permute(1, 2, 0)
cur_res = cur_res.detach().cpu().numpy()
if unpad_to_size is not None:
orig_height, orig_width = unpad_to_size
cur_res = cur_res[:orig_height, :orig_width]
cur_res = np.clip(cur_res * 255, 0, 255).astype('uint8')
return cur_res
# Flask route to process images
@app.route('/process', methods=['GET'])
def process():
image_url = request.args.get('image_url')
if not image_url:
return jsonify({"error": "No image URL provided"}), 400
try:
# Download image from URL
image = download_image(image_url)
# Process image with YOLOv11 and LaMa
processed_image = process_image_lama(image)
# Encode processed image to WebP format
_, buffer = cv2.imencode('.webp', processed_image, [cv2.IMWRITE_WEBP_QUALITY, 90])
# Send the processed image as a response
return send_file(io.BytesIO(buffer), mimetype='image/webp')
except Exception as e:
return jsonify({"error": str(e)}), 500
if __name__ == '__main__':
app.run(debug=True)
With both YOLOv11 and LaMa integrated, the Flask API serves as a robust interface for processing images. Users can submit image URLs, and the API will return images with the specified text objects seamlessly removed.
Conclusion: Embracing Complexity and Continuous Learning
This endeavor served as a profound reminder that tasks which may initially appear "simple" can unravel into complex challenges when scaled. It underscored the importance of leveraging cutting-edge advancements in AI to navigate and overcome such obstacles effectively.
Moreover, this project reaffirmed the immense value of engaging in hands-on work, even within a leadership role. By immersing myself directly in the technical aspects, I not only addressed a specific problem but also stayed attuned to the rapid advancements in AI and machine learning. This dual perspective—balancing managerial oversight with technical engagement—is crucial for fostering innovation and maintaining a forward-thinking approach in the ever-evolving tech landscape.
Product Manager | FinTech Innovator | Champion of User-Centric Design and Agile Methodologies | Driving innovative products from concept to market
10moVery informative
Senior Software Engineer & Technical Team Lead | Driving High-Performing Teams & Building Scalable Products
10moGG Boss 🫡
Senior Applied AI Engineer - MLOps | LLMOps | GenOps
10moProud to have learned from you!
Head of Quality Engineering | QA Transformation Leader | Driving Engineering Excellence & Automation | ISTQB CTFL · CTAL · CMT
10mo🔥🔥🔥
Director Of Technology @ Boutiqaat.com
10moAwesome