Allan Niñal

Posted on May 18 • Edited on May 20

Building an AI-Powered Image Captioning App with React and Flask

In this tutorial, we'll walk through building a full-stack application that generates descriptive captions for uploaded images using AI. The application combines a React frontend with a Flask backend and leverages Salesforce's BLIP (Bootstrapped Language-Image Pretraining) model via Hugging Face's transformers library.

What We'll Build

We'll create an application that allows users to:

Upload an image from their device
Send the image to a Flask backend
Process the image with the BLIP AI model
Display the generated caption

System Architecture

Here's a high-level diagram of our application architecture:

Tech Stack Overview

Frontend

React: For building the user interface
Axios: For making HTTP requests to the backend
Vite: For fast development and bundling

Backend

Flask: For creating the REST API
Flask-CORS: For handling cross-origin requests
Transformers: Hugging Face's library for using pre-trained models
Pillow: For image processing

AI Model

BLIP (Bootstrapped Language-Image Pretraining): Salesforce's model for generating image captions

Step 1: Setting Up the Backend

Let's start by creating our Flask backend which will handle image processing and caption generation.

First, we need to install the necessary dependencies:

pip install flask flask-cors transformers torch torchvision pillow

Next, create a file called app.py:

import logging
from flask import Flask, request, jsonify
from flask_cors import CORS
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import base64
import io

# Initialize the Flask application
app = Flask(__name__)
# Enable Cross-Origin Resource Sharing (CORS) for the app
CORS(app)

# Configure logging to display information level logs
logging.basicConfig(level=logging.INFO)

# Configuration for the model name
MODEL_NAME = "Salesforce/blip-image-captioning-base"

# Load the BLIP model and processor using the specified model name
captioning_model = BlipForConditionalGeneration.from_pretrained(MODEL_NAME)
image_processor = BlipProcessor.from_pretrained(MODEL_NAME)

def decode_image(base64_image):
    """
    Decode a base64 encoded string to a PIL image.
    """
    try:
        # Decode the base64 string to bytes
        image_bytes = base64.b64decode(base64_image)
        # Convert bytes to a PIL Image
        return Image.open(io.BytesIO(image_bytes))
    except Exception as e:
        logging.error("Failed to decode image: %s", e)
        raise ValueError("Invalid image data")

def generate_caption(image):
    """
    Generate a caption for the given image using the BLIP model.
    """
    try:
        # Process the image and prepare it for the model
        model_inputs = image_processor(image, return_tensors="pt")
        # Generate a caption using the model
        model_output = captioning_model.generate(**model_inputs)
        # Decode the model output to a human-readable string
        return image_processor.decode(model_output[0], skip_special_tokens=True)
    except Exception as e:
        logging.error("Failed to generate caption: %s", e)
        raise RuntimeError("Caption generation failed")

@app.route('/caption', methods=['POST'])
def caption_image():
    """
    Endpoint to generate a caption for a given image.
    """
    try:
        # Retrieve JSON data from the request
        request_data = request.json
        # Extract base64 encoded image data
        base64_image = request_data.get("image", "")
        if not base64_image:
            return jsonify({"error": "No image data provided"}), 400

        # Decode the image and generate a caption
        image = decode_image(base64_image)
        generated_caption = generate_caption(image)
        # Return the generated caption as a JSON response
        return jsonify({"caption": generated_caption})

    except ValueError as ve:
        # Handle invalid image data
        return jsonify({"error": str(ve)}), 400
    except RuntimeError as re:
        # Handle caption generation failure
        return jsonify({"error": str(re)}), 500
    except Exception as error:
        # Handle unexpected errors
        logging.error("Unexpected error: %s", error)
        return jsonify({"error": "An unexpected error occurred"}), 500

if __name__ == '__main__':
    # Run the Flask application in debug mode
    app.run(debug=True)

This backend performs three main functions:

Decodes base64-encoded image data received from the frontend
Processes the image with the BLIP model
Returns the generated caption as a JSON response

Step 2: Creating the React Frontend

Now, let's build our React frontend with Vite. First, set up a new React project:

npm create vite@latest frontend -- --template react
cd frontend
npm install
npm install axios

Now, let's create our main App component in src/App.jsx:

import React, { useState } from "react";
import axios from "axios";

/**
 * App component for the Image Captioning application.
 * Allows users to upload an image and generate a caption using a backend service.
 */
function App() {
    // State to store the selected image as a base64 string
    const [selectedImage, setSelectedImage] = useState(null);
    // State to store the generated caption for the image
    const [generatedCaption, setGeneratedCaption] = useState("");
    // State to store any error messages
    const [errorMessage, setErrorMessage] = useState("");

    // Styles object to manage inline styles for the component
    const styles = {
        container: { padding: "20px", maxWidth: "600px", margin: "0 auto" },
        imagePreview: { width: "100%", maxHeight: "300px" },
        button: { padding: "10px", marginTop: "20px", cursor: "pointer" },
        errorText: { marginTop: "20px", color: "red" },
        captionText: { marginTop: "20px" }
    };

    /**
     * Handles the image upload event.
     * Reads the uploaded file and converts it to a base64 string.
     * @param {Object} event - The file input change event.
     */
    const handleImageUpload = (event) => {
        const [uploadedFile] = event.target.files; // Destructure to get the first file
        if (uploadedFile) {
            const fileReader = new FileReader();
            // Set the selected image state when file reading is complete
            fileReader.onloadend = () => setSelectedImage(fileReader.result);
            // Set an error message if file reading fails
            fileReader.onerror = () => setErrorMessage("Failed to read file.");
            // Read the file as a data URL (base64 string)
            fileReader.readAsDataURL(uploadedFile);
        }
    };

    /**
     * Sends the selected image to the backend to generate a caption.
     * Updates the generated caption or error message based on the response.
     */
    const handleGenerateCaption = async () => {
        try {
            setErrorMessage(""); // Clear any previous error messages
            setGeneratedCaption("Generating caption..."); // Indicate caption generation in progress

            // Extract the base64 part of the image data
            const base64ImageData = selectedImage?.split(",")[1];
            // Send a POST request to the backend with the image data
            const response = await axios.post("http://127.0.0.1:5000/caption", { image: base64ImageData });

            // Update the generated caption with the response or a default message
            setGeneratedCaption(response.data?.caption || "No caption generated.");
        } catch (err) {
            // Set an error message if the request fails
            setErrorMessage("Failed to generate caption. Please try again.");
        }
    };

    return (
        <div style={styles.container}>
            <h1>Image Captioning App</h1>
            {/* File input for uploading images */}
            <input type="file" accept="image/*" onChange={handleImageUpload} />
            {/* Display the selected image if available */}
            {selectedImage && (
                <div style={{ marginTop: "20px" }}>
                    <img src={selectedImage} alt="Preview" style={styles.imagePreview} />
                </div>
            )}
            {/* Button to trigger caption generation */}
            <button onClick={handleGenerateCaption} style={styles.button}>
                Generate Caption
            </button>
            {/* Display the generated caption if available */}
            {generatedCaption && <p style={styles.captionText}>Caption: {generatedCaption}</p>}
            {/* Display an error message if available */}
            {errorMessage && <p style={styles.errorText}>{errorMessage}</p>}
        </div>
    );
}

export default App;

This frontend provides:

An input for uploading images
A preview of the selected image
A button to trigger caption generation
Display areas for the generated caption and any error messages

How It Works: The Data Flow

Here's a detailed flowchart of how data moves through our application:

Understanding the BLIP Model

The BLIP (Bootstrapped Language-Image Pretraining) model from Salesforce is a powerful vision-language model that can perform various tasks including image captioning.

Key Features of BLIP

Multimodal Learning: BLIP understands both images and text, allowing it to generate coherent captions that describe the content of images.
Bootstrapped Learning: It uses a bootstrapped approach that helps clean noisy image-text pairs from the web, resulting in better performance.
Versatility: Beyond image captioning, BLIP can also perform visual question answering, image-text retrieval, and more.

BLIP was introduced in the paper "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation" by Li et al. (2022)¹.

Error Handling and Optimization

Our application includes several error-handling measures:

Frontend Error Handling:
- Checks for valid image uploads
- Displays user-friendly error messages
- Shows loading states during caption generation
Backend Error Handling:
- Validates input data
- Catches and logs exceptions
- Returns appropriate HTTP status codes

Potential Enhancements

Here are some ways to extend this application:

Multiple Caption Generation: Generate multiple captions with different parameters.
User Feedback Loop: Allow users to rate captions and use this feedback to fine-tune the model.
Style Transfer: Add image filters or style transfer options before captioning.
Progressive Web App (PWA): Convert to a PWA for offline capabilities.
Advanced UI: Implement drag-and-drop functionality and animations.

Performance Considerations

When working with ML models like BLIP, consider the following:

Model Size: The BLIP model is large (several hundred MB). Consider loading strategies or serving options to optimize initial load time.
Caching: Implement caching for repeated requests with the same images.
Batching: If supporting multiple users, implement request batching to increase throughput.

Conclusion

In this tutorial, we've built a complete image captioning application using React, Flask, and the BLIP model. This project demonstrates how to:

Set up a Flask backend with a machine learning model
Create a React frontend for image upload and display
Implement communication between frontend and backend
Process and transform data for AI model consumption

The combination of modern web technologies with powerful AI models opens up endless possibilities for creative applications. The techniques shown here can be extended to other vision-language tasks like visual question answering, image generation, and more.

Resources and References

GitHub Repository: Image Captioning App

Open for Projects

I'm currently available to take on new projects in the following areas:

Artificial Intelligence solutions (both no-code and custom development)
No-code automation with n8n (and open to other automation platforms)
React.js frontend development
Node.js backend/API development
WooCommerce development and customization
Stripe payment integration and automation
PHP applications and frameworks
Python development
Supabase, Vercel & GitHub integration

My Expertise

I'm a Senior Web Developer with growing expertise in AI/ML solutions, passionate about creating practical applications that leverage artificial intelligence to solve real-world problems. While relatively new to AI/ML development (less than a year of focused experience), I've quickly built a portfolio of functional projects that demonstrate my ability to integrate AI capabilities into useful applications. My specialized skills include:

AI Integration: Connecting pre-trained AI models with web applications through APIs and direct implementation
Computer Vision & NLP: Implementing image captioning, sentiment analysis, text summarization, chatbots, and language translation applications
Agentic AI Workflows: Creating intelligent autonomous agents that can execute complex tasks through multi-step reasoning
Full-Stack Development: Crafting seamless experiences with React.js frontends and Python/Flask or Node.js backends
E-commerce Solutions: Expert in WooCommerce/Stripe integrations with subscription management and payment processing
Automation Tools: Python scripts and n8n workflows for business-critical processes and data pipelines
Content Automation: Creating AI-powered systems that generate complete content packages from blog posts to social media updates

Featured Projects

Personal AI Chatbot - A complete conversational AI application built with React and Flask, powered by Microsoft's DialoGPT-medium model from Hugging Face. This project demonstrates how to create an interactive chatbot with a clean, responsive interface that understands and generates human-like text responses.

Image Captioning App - A full-stack application that generates descriptive captions for uploaded images using AI. Built with React for the frontend and Flask for the backend, this app leverages Salesforce's BLIP model via Hugging Face's transformers library to analyze images and create natural language descriptions of their content.

Sentiment Analysis App - A lightweight full-stack application that performs sentiment analysis on user-provided text using React.js for the frontend and Flask with Hugging Face Transformers for the backend. This project demonstrates how easily powerful pre-trained NLP models can be integrated into modern web applications.

Agentic AI Workflow - A Python-based framework for building intelligent AI agents that can break down complex tasks into manageable steps and execute them sequentially. This project demonstrates how to leverage OpenRouter API to access multiple AI models (OpenAI, Anthropic, Google, etc.) through a unified interface, enabling more sophisticated problem-solving capabilities and better reasoning in AI applications.

WiseCashAI - A revolutionary privacy-first financial management platform that operates primarily in your browser, ensuring your sensitive financial data never leaves your control. Unlike cloud-based alternatives that collect and monetize your information, WiseCashAI offers AI-powered features like intelligent transaction categorization, envelope-based budgeting, and goal tracking while keeping your data local. Optional Google Drive integration with end-to-end encryption provides cross-device access without compromising privacy.

Content Automation Workflow Pro - AI-powered content generation system that transforms content creation with a single command. This Python-based workflow leverages OpenRouter and Replicate to generate SEO-optimized blog posts, custom thumbnail images, and platform-specific social media posts across 7+ platforms, reducing content creation time from hours to minutes.

Stripe/WooCommerce Integration Tools:

Stripe Validator Tool - Cross-references WooCommerce subscription data with the Stripe API to prevent payment failures (78% reduction in failures)
Invoice Notifier System - Automatically identifies overdue invoices and sends strategic payment reminders (64% reduction in payment delays)
WooCommerce Bulk Refunder - Python script for efficiently processing bulk refunds with direct payment gateway API integration

Open-Source AI Mini Projects

I'm actively developing open-source AI applications that solve real-world problems:

Image Captioning App - Generates descriptive captions for images using Hugging Face's BLIP model
AI Resume Analyzer - Extracts key details from resumes using BERT-based NER models
Document Summarizer - Creates concise summaries from lengthy documents using BART models
Multilingual Translator - Real-time translation tool supporting multiple language pairs
Toxic Comment Detector - Identifies harmful or offensive language in real-time
Recipe Finder - AI-powered tool that recommends recipes based on available ingredients
Personal AI Chatbot - Customizable chat application built with DialoGPT

All these projects are available on my GitHub with full source code.

Development Philosophy

I believe in creating technology that empowers users without compromising their privacy or control. My projects focus on:

Privacy-First Design: Keeping sensitive data under user control by default
Practical AI Applications: Leveraging AI capabilities to solve real-world problems
Modular Architecture: Building systems with clear separation of concerns for better maintainability
Accessibility: Making powerful tools available to everyone regardless of technical expertise
Open Source: Contributing to the community and ensuring transparency

Technical Articles & Tutorials

I regularly share detailed tutorials on AI development, automation, and integration solutions:

Building a Personal AI Chatbot with React and Flask - Complete guide to creating a conversational AI application
Building an Image Captioning App with React, Flask and BLIP - Learn how to create a computer vision application that generates natural language descriptions of images
Building a Sentiment Analysis App with React and Flask - Step-by-step guide to creating a full-stack NLP application
Creating an Agentic AI Workflow with OpenRouter - Tutorial on building intelligent AI agents
Getting Started with Content Automation Workflow Pro - Comprehensive guide to automated content creation
Building Privacy-First AI Applications - Techniques for implementing AI features while respecting user privacy

I specialize in developing practical solutions that leverage AI and automation to solve real business problems and deliver measurable results. Find my tutorials on DEV.to and premium tools in my Gumroad store.

If you have a project involving e-commerce, content automation, financial tools, or custom AI applications, feel free to reach out directly at [email protected].

Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML 2022. arXiv:2201.12086 ↩

DEV Community