In this tutorial, we'll walk through building a full-stack application that generates descriptive captions for uploaded images using AI. The application combines a React frontend with a Flask backend and leverages Salesforce's BLIP (Bootstrapped Language-Image Pretraining) model via Hugging Face's transformers library.
What We'll Build
We'll create an application that allows users to:
- Upload an image from their device
- Send the image to a Flask backend
- Process the image with the BLIP AI model
- Display the generated caption
System Architecture
Here's a high-level diagram of our application architecture:
Tech Stack Overview
Frontend
- React: For building the user interface
- Axios: For making HTTP requests to the backend
- Vite: For fast development and bundling
Backend
- Flask: For creating the REST API
- Flask-CORS: For handling cross-origin requests
- Transformers: Hugging Face's library for using pre-trained models
- Pillow: For image processing
AI Model
- BLIP (Bootstrapped Language-Image Pretraining): Salesforce's model for generating image captions
Step 1: Setting Up the Backend
Let's start by creating our Flask backend which will handle image processing and caption generation.
First, we need to install the necessary dependencies:
pip install flask flask-cors transformers torch torchvision pillow
Next, create a file called app.py
:
import logging
from flask import Flask, request, jsonify
from flask_cors import CORS
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import base64
import io
# Initialize the Flask application
app = Flask(__name__)
# Enable Cross-Origin Resource Sharing (CORS) for the app
CORS(app)
# Configure logging to display information level logs
logging.basicConfig(level=logging.INFO)
# Configuration for the model name
MODEL_NAME = "Salesforce/blip-image-captioning-base"
# Load the BLIP model and processor using the specified model name
captioning_model = BlipForConditionalGeneration.from_pretrained(MODEL_NAME)
image_processor = BlipProcessor.from_pretrained(MODEL_NAME)
def decode_image(base64_image):
"""
Decode a base64 encoded string to a PIL image.
"""
try:
# Decode the base64 string to bytes
image_bytes = base64.b64decode(base64_image)
# Convert bytes to a PIL Image
return Image.open(io.BytesIO(image_bytes))
except Exception as e:
logging.error("Failed to decode image: %s", e)
raise ValueError("Invalid image data")
def generate_caption(image):
"""
Generate a caption for the given image using the BLIP model.
"""
try:
# Process the image and prepare it for the model
model_inputs = image_processor(image, return_tensors="pt")
# Generate a caption using the model
model_output = captioning_model.generate(**model_inputs)
# Decode the model output to a human-readable string
return image_processor.decode(model_output[0], skip_special_tokens=True)
except Exception as e:
logging.error("Failed to generate caption: %s", e)
raise RuntimeError("Caption generation failed")
@app.route('/caption', methods=['POST'])
def caption_image():
"""
Endpoint to generate a caption for a given image.
"""
try:
# Retrieve JSON data from the request
request_data = request.json
# Extract base64 encoded image data
base64_image = request_data.get("image", "")
if not base64_image:
return jsonify({"error": "No image data provided"}), 400
# Decode the image and generate a caption
image = decode_image(base64_image)
generated_caption = generate_caption(image)
# Return the generated caption as a JSON response
return jsonify({"caption": generated_caption})
except ValueError as ve:
# Handle invalid image data
return jsonify({"error": str(ve)}), 400
except RuntimeError as re:
# Handle caption generation failure
return jsonify({"error": str(re)}), 500
except Exception as error:
# Handle unexpected errors
logging.error("Unexpected error: %s", error)
return jsonify({"error": "An unexpected error occurred"}), 500
if __name__ == '__main__':
# Run the Flask application in debug mode
app.run(debug=True)
This backend performs three main functions:
- Decodes base64-encoded image data received from the frontend
- Processes the image with the BLIP model
- Returns the generated caption as a JSON response
Step 2: Creating the React Frontend
Now, let's build our React frontend with Vite. First, set up a new React project:
npm create vite@latest frontend -- --template react
cd frontend
npm install
npm install axios
Now, let's create our main App component in src/App.jsx
:
import React, { useState } from "react";
import axios from "axios";
/**
* App component for the Image Captioning application.
* Allows users to upload an image and generate a caption using a backend service.
*/
function App() {
// State to store the selected image as a base64 string
const [selectedImage, setSelectedImage] = useState(null);
// State to store the generated caption for the image
const [generatedCaption, setGeneratedCaption] = useState("");
// State to store any error messages
const [errorMessage, setErrorMessage] = useState("");
// Styles object to manage inline styles for the component
const styles = {
container: { padding: "20px", maxWidth: "600px", margin: "0 auto" },
imagePreview: { width: "100%", maxHeight: "300px" },
button: { padding: "10px", marginTop: "20px", cursor: "pointer" },
errorText: { marginTop: "20px", color: "red" },
captionText: { marginTop: "20px" }
};
/**
* Handles the image upload event.
* Reads the uploaded file and converts it to a base64 string.
* @param {Object} event - The file input change event.
*/
const handleImageUpload = (event) => {
const [uploadedFile] = event.target.files; // Destructure to get the first file
if (uploadedFile) {
const fileReader = new FileReader();
// Set the selected image state when file reading is complete
fileReader.onloadend = () => setSelectedImage(fileReader.result);
// Set an error message if file reading fails
fileReader.onerror = () => setErrorMessage("Failed to read file.");
// Read the file as a data URL (base64 string)
fileReader.readAsDataURL(uploadedFile);
}
};
/**
* Sends the selected image to the backend to generate a caption.
* Updates the generated caption or error message based on the response.
*/
const handleGenerateCaption = async () => {
try {
setErrorMessage(""); // Clear any previous error messages
setGeneratedCaption("Generating caption..."); // Indicate caption generation in progress
// Extract the base64 part of the image data
const base64ImageData = selectedImage?.split(",")[1];
// Send a POST request to the backend with the image data
const response = await axios.post("http://127.0.0.1:5000/caption", { image: base64ImageData });
// Update the generated caption with the response or a default message
setGeneratedCaption(response.data?.caption || "No caption generated.");
} catch (err) {
// Set an error message if the request fails
setErrorMessage("Failed to generate caption. Please try again.");
}
};
return (
<div style={styles.container}>
<h1>Image Captioning App</h1>
{/* File input for uploading images */}
<input type="file" accept="image/*" onChange={handleImageUpload} />
{/* Display the selected image if available */}
{selectedImage && (
<div style={{ marginTop: "20px" }}>
<img src={selectedImage} alt="Preview" style={styles.imagePreview} />
</div>
)}
{/* Button to trigger caption generation */}
<button onClick={handleGenerateCaption} style={styles.button}>
Generate Caption
</button>
{/* Display the generated caption if available */}
{generatedCaption && <p style={styles.captionText}>Caption: {generatedCaption}</p>}
{/* Display an error message if available */}
{errorMessage && <p style={styles.errorText}>{errorMessage}</p>}
</div>
);
}
export default App;
This frontend provides:
- An input for uploading images
- A preview of the selected image
- A button to trigger caption generation
- Display areas for the generated caption and any error messages
How It Works: The Data Flow
Here's a detailed flowchart of how data moves through our application:
Understanding the BLIP Model
The BLIP (Bootstrapped Language-Image Pretraining) model from Salesforce is a powerful vision-language model that can perform various tasks including image captioning.
Key Features of BLIP
Multimodal Learning: BLIP understands both images and text, allowing it to generate coherent captions that describe the content of images.
Bootstrapped Learning: It uses a bootstrapped approach that helps clean noisy image-text pairs from the web, resulting in better performance.
Versatility: Beyond image captioning, BLIP can also perform visual question answering, image-text retrieval, and more.
BLIP was introduced in the paper "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation" by Li et al. (2022)1.
Error Handling and Optimization
Our application includes several error-handling measures:
-
Frontend Error Handling:
- Checks for valid image uploads
- Displays user-friendly error messages
- Shows loading states during caption generation
-
Backend Error Handling:
- Validates input data
- Catches and logs exceptions
- Returns appropriate HTTP status codes
Potential Enhancements
Here are some ways to extend this application:
Multiple Caption Generation: Generate multiple captions with different parameters.
User Feedback Loop: Allow users to rate captions and use this feedback to fine-tune the model.
Style Transfer: Add image filters or style transfer options before captioning.
Progressive Web App (PWA): Convert to a PWA for offline capabilities.
Advanced UI: Implement drag-and-drop functionality and animations.
Performance Considerations
When working with ML models like BLIP, consider the following:
Model Size: The BLIP model is large (several hundred MB). Consider loading strategies or serving options to optimize initial load time.
Caching: Implement caching for repeated requests with the same images.
Batching: If supporting multiple users, implement request batching to increase throughput.
Conclusion
In this tutorial, we've built a complete image captioning application using React, Flask, and the BLIP model. This project demonstrates how to:
- Set up a Flask backend with a machine learning model
- Create a React frontend for image upload and display
- Implement communication between frontend and backend
- Process and transform data for AI model consumption
The combination of modern web technologies with powerful AI models opens up endless possibilities for creative applications. The techniques shown here can be extended to other vision-language tasks like visual question answering, image generation, and more.
Resources and References
- Hugging Face Transformers Documentation
- Flask Documentation
- React Documentation
- Salesforce BLIP Model Card
GitHub Repository: Image Captioning App
Open for Projects
I'm currently available to take on new projects in the following areas:
- Artificial Intelligence solutions (both no-code and custom development)
- No-code automation with n8n (and open to other automation platforms)
- React.js frontend development
- Node.js backend/API development
- WooCommerce development and customization
- Stripe payment integration and automation
- PHP applications and frameworks
- Python development
- Supabase, Vercel & GitHub integration
My Expertise
I'm a Senior Web Developer with growing expertise in AI/ML solutions, passionate about creating practical applications that leverage artificial intelligence to solve real-world problems. While relatively new to AI/ML development (less than a year of focused experience), I've quickly built a portfolio of functional projects that demonstrate my ability to integrate AI capabilities into useful applications. My specialized skills include:
- AI Integration: Connecting pre-trained AI models with web applications through APIs and direct implementation
- Computer Vision & NLP: Implementing image captioning, sentiment analysis, text summarization, chatbots, and language translation applications
- Agentic AI Workflows: Creating intelligent autonomous agents that can execute complex tasks through multi-step reasoning
- Full-Stack Development: Crafting seamless experiences with React.js frontends and Python/Flask or Node.js backends
- E-commerce Solutions: Expert in WooCommerce/Stripe integrations with subscription management and payment processing
- Automation Tools: Python scripts and n8n workflows for business-critical processes and data pipelines
- Content Automation: Creating AI-powered systems that generate complete content packages from blog posts to social media updates
Featured Projects
Personal AI Chatbot - A complete conversational AI application built with React and Flask, powered by Microsoft's DialoGPT-medium model from Hugging Face. This project demonstrates how to create an interactive chatbot with a clean, responsive interface that understands and generates human-like text responses.
Image Captioning App - A full-stack application that generates descriptive captions for uploaded images using AI. Built with React for the frontend and Flask for the backend, this app leverages Salesforce's BLIP model via Hugging Face's transformers library to analyze images and create natural language descriptions of their content.
Sentiment Analysis App - A lightweight full-stack application that performs sentiment analysis on user-provided text using React.js for the frontend and Flask with Hugging Face Transformers for the backend. This project demonstrates how easily powerful pre-trained NLP models can be integrated into modern web applications.
Agentic AI Workflow - A Python-based framework for building intelligent AI agents that can break down complex tasks into manageable steps and execute them sequentially. This project demonstrates how to leverage OpenRouter API to access multiple AI models (OpenAI, Anthropic, Google, etc.) through a unified interface, enabling more sophisticated problem-solving capabilities and better reasoning in AI applications.
WiseCashAI - A revolutionary privacy-first financial management platform that operates primarily in your browser, ensuring your sensitive financial data never leaves your control. Unlike cloud-based alternatives that collect and monetize your information, WiseCashAI offers AI-powered features like intelligent transaction categorization, envelope-based budgeting, and goal tracking while keeping your data local. Optional Google Drive integration with end-to-end encryption provides cross-device access without compromising privacy.
Content Automation Workflow Pro - AI-powered content generation system that transforms content creation with a single command. This Python-based workflow leverages OpenRouter and Replicate to generate SEO-optimized blog posts, custom thumbnail images, and platform-specific social media posts across 7+ platforms, reducing content creation time from hours to minutes.
Stripe/WooCommerce Integration Tools:
- Stripe Validator Tool - Cross-references WooCommerce subscription data with the Stripe API to prevent payment failures (78% reduction in failures)
- Invoice Notifier System - Automatically identifies overdue invoices and sends strategic payment reminders (64% reduction in payment delays)
- WooCommerce Bulk Refunder - Python script for efficiently processing bulk refunds with direct payment gateway API integration
Open-Source AI Mini Projects
I'm actively developing open-source AI applications that solve real-world problems:
- Image Captioning App - Generates descriptive captions for images using Hugging Face's BLIP model
- AI Resume Analyzer - Extracts key details from resumes using BERT-based NER models
- Document Summarizer - Creates concise summaries from lengthy documents using BART models
- Multilingual Translator - Real-time translation tool supporting multiple language pairs
- Toxic Comment Detector - Identifies harmful or offensive language in real-time
- Recipe Finder - AI-powered tool that recommends recipes based on available ingredients
- Personal AI Chatbot - Customizable chat application built with DialoGPT
All these projects are available on my GitHub with full source code.
Development Philosophy
I believe in creating technology that empowers users without compromising their privacy or control. My projects focus on:
- Privacy-First Design: Keeping sensitive data under user control by default
- Practical AI Applications: Leveraging AI capabilities to solve real-world problems
- Modular Architecture: Building systems with clear separation of concerns for better maintainability
- Accessibility: Making powerful tools available to everyone regardless of technical expertise
- Open Source: Contributing to the community and ensuring transparency
Technical Articles & Tutorials
I regularly share detailed tutorials on AI development, automation, and integration solutions:
- Building a Personal AI Chatbot with React and Flask - Complete guide to creating a conversational AI application
- Building an Image Captioning App with React, Flask and BLIP - Learn how to create a computer vision application that generates natural language descriptions of images
- Building a Sentiment Analysis App with React and Flask - Step-by-step guide to creating a full-stack NLP application
- Creating an Agentic AI Workflow with OpenRouter - Tutorial on building intelligent AI agents
- Getting Started with Content Automation Workflow Pro - Comprehensive guide to automated content creation
- Building Privacy-First AI Applications - Techniques for implementing AI features while respecting user privacy
I specialize in developing practical solutions that leverage AI and automation to solve real business problems and deliver measurable results. Find my tutorials on DEV.to and premium tools in my Gumroad store.
If you have a project involving e-commerce, content automation, financial tools, or custom AI applications, feel free to reach out directly at [email protected].
-
Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML 2022. arXiv:2201.12086 ↩
Top comments (0)