DEV Community

Cover image for Evaluating Google Gemini for Document OCR Using Hugging Face Invoice Dataset
Mayank Gupta
Mayank Gupta

Posted on

Evaluating Google Gemini for Document OCR Using Hugging Face Invoice Dataset

In the digital age, invoices are the lifeblood of businesses, but processing them manually can be a monumental task, prone to errors and inefficiency. This is where Optical Character Recognition (OCR) shines, transforming scanned documents into structured, usable data. With the rise of advanced AI models like Google's Gemini, the promise of highly accurate and intelligent OCR has never been closer.

But how well does Gemini actually perform on real-world documents like invoices? And how can we systematically evaluate its accuracy? This blog post dives into just that, demonstrating a practical approach to benchmark Gemini's OCR capabilities using the widely accessible Hugging Face invoices-donut-data-v1 dataset.

The Challenge of Invoice OCR: More Than Just Reading Text

Imagine an invoice. It's not just a block of text; it contains crucial, structured information: invoice numbers, dates, vendor details, line items with descriptions, quantities, and prices, and of course, the grand total. A truly effective OCR solution for invoices needs to do more than just extract raw text; it needs to understand the meaning of that text within the document's context, identify these specific fields, and present them in a structured format, typically JSON.

Traditional OCR might give you a jumbled string of all the words on the page. Advanced, intelligent OCR, like what Gemini aims to provide, should be able to tell you, "This is the invoice number," "This is the total amount," and so on.

Our Battlefield: The Hugging Face invoices-donut-data-v1 Dataset

For our evaluation, we turn to a fantastic resource: the katanaml-org/invoices-donut-data-v1 dataset available on Hugging Face. This dataset is specifically designed for document understanding tasks, offering a collection of invoice images paired with their "ground truth" – the perfect, manually extracted JSON representation of the invoice data. This "ground truth" is our gold standard against which we'll compare Gemini's output.

Each sample in this dataset provides:

  • An image: The invoice document itself.
  • ground_truth: A JSON string containing the accurately extracted fields, often with a nested gt_parse key holding the structured data we care about.

The Gemini Advantage: Multimodal Power for Document Understanding

Gemini models, especially versions like Gemini 1.5 Pro and Flash, are inherently multimodal. This means they can process and understand information from various modalities simultaneously – text, images, and even audio or video. For OCR, this is a game-changer. Instead of just "seeing" pixels, Gemini can leverage its understanding of visual layout, textual patterns, and even common invoice structures to more accurately extract and interpret information.

While the exact API call for Gemini's specialized document parsing might vary, the core principle remains: you send an image, and you receive a structured response. For this demonstration, we'll assume an API endpoint (API_URL) that takes an image and returns a JSON object containing the OCR'd data. Your API_KEY will, of course, be required for authentication.

Setting Up the Evaluation Pipeline (Code Walkthrough)

Let's break down the Python code used for this evaluation.

First, we install necessary libraries:

!pip install --upgrade datasets fsspec huggingface_hub jiwer
!apt install git-lfs # For potential git large file storage needs, though not strictly required for this dataset
!git lfs install
!git clone https://huggingface.co/datasets/openthaigpt/thai-ocr-evaluation # Not directly used in this script but good for context
Enter fullscreen mode Exit fullscreen mode

Next, we load the invoices-donut-data-v1 dataset:

from datasets import load_dataset
import requests
import io
from PIL import Image
import json

# Load dataset
dataset = load_dataset("katanaml-org/invoices-donut-data-v1")["test"]

results = []

for i, sample in enumerate(dataset):
    image = sample["image"]
    ground_truth_json_str = sample["ground_truth"] # Renamed to avoid shadowing

    # Convert PIL image to byte stream
    buffer = io.BytesIO()
    image.save(buffer, format="PNG")
    buffer.seek(0)

    # Prepare request for Gemini OCR API
    # The actual API call for Gemini might look different,
    # often involving `google.generativeai.GenerativeModel.generate_content`
    # and structuring your prompt to ask for structured data extraction.
    # For this example, we're simulating a generic OCR API call.
    files = {
        "files": ("image.png", buffer, "image/png"),
    }
    data = {
        "template": "benchmark" # This could be a prompt for Gemini to extract invoice data
    }
    headers = {
        "X-API-Key": API_KEY
    }

    # Send to your OCR API (simulating Gemini API call)
    # In a real Gemini integration, you'd use the `google.generativeai` client
    # and craft a prompt like:
    # response = model.generate_content([image, "Extract all invoice details as a JSON object, including invoice_number, total_amount, date, and line_items with description, quantity, and price."])
    # ocr_output = response.text or response.parts[0].text if it's text-based output
    response = requests.post(API_URL, headers=headers, files=files, data=data)

    try:
        response_json = response.json()
        # Adjust 'result' based on your actual Gemini API response structure
        ocr_output = response_json.get("result", "")
    except Exception as e:
        ocr_output = f"Error: {str(e)}"

    # We need a unique ID for each image, typically from the dataset itself or a generated one.
    # For simplicity, let's use the loop index or assume a unique ID field exists in `sample`.
    # As the original code didn't define image_id, let's use a simple index.
    image_id = f"sample_{i}"

    results.append({
        "id": image_id,
        "ground_truth": ground_truth_json_str, # Keep as string for initial storage
        "prediction": ocr_output,
    })
Enter fullscreen mode Exit fullscreen mode

Key modification for Gemini: The requests.post call is a placeholder. In a real-world scenario, you would use the google-generativeai library. Your prompt to Gemini would be crucial, guiding it to extract the specific invoice fields in a structured (e.g., JSON) format.

For example, a conceptual Gemini integration might look like this:

import google.generativeai as genai
from PIL import Image

# Configure your Gemini API key
genai.configure(api_key=API_KEY)

# Initialize the model
model = genai.GenerativeModel('gemini-pro-vision') # Or 'gemini-1.5-flash', 'gemini-1.5-pro'

# Inside your loop:
# image is a PIL Image object
# Craft a detailed prompt for invoice extraction
prompt = (
    "Extract the following details from this invoice and provide them in a JSON format:\n"
    "{\n"
    "  \"gt_parse\": {\n"
    "    \"invoice_number\": \"\",\n"
    "    \"date\": \"\",\n"
    "    \"total_amount\": \"\",\n"
    "    \"vendor_name\": \"\",\n"
    "    \"line_items\": [\n"
    "      {\n"
    "        \"description\": \"\",\n"
    "        \"quantity\": \"\",\n"
    "        \"unit_price\": \"\",\n"
    "        \"amount\": \"\"\n"
    "      }\n"
    "    ]\n"
    "  }\n"
    "}"
    "Ensure all values are extracted as strings. If a field is not present, leave its value as an empty string."
)

try:
    response = model.generate_content([prompt, image])
    # Gemini's response.text contains the extracted JSON string
    ocr_output = response.text
except Exception as e:
    ocr_output = f"Error during Gemini processing: {str(e)}"
Enter fullscreen mode Exit fullscreen mode

This conceptual integration highlights how Gemini's multi-modal capabilities allow you to provide both the image and a specific instruction (the prompt) to guide its OCR and information extraction process.

Measuring Success: Beyond Simple Text Comparison

Evaluating OCR for structured documents requires more than just a simple string match. We need to assess how accurately individual fields are extracted. For this, we'll use the Character Error Rate (CER) and field-level accuracy.

The jiwer library is excellent for calculating CER, which measures the minimum number of edits (insertions, deletions, substitutions) needed to change one string into another, divided by the length of the ground truth string. A lower CER indicates higher accuracy.

We'll also calculate "accuracy" as the proportion of fields that are exactly matched between the ground truth and the prediction.

import json
from jiwer import cer
from collections.abc import Mapping

# Utility: flatten nested dicts with compound keys
def flatten_dict(d, parent_key='', sep='.'):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, Mapping):
            items.extend(flatten_dict(v, new_key, sep=sep).items())
        elif isinstance(v, list):
            for i, item in enumerate(v):
                items.extend(flatten_dict(item, f"{new_key}[{i}]", sep=sep).items())
        else:
            items.append((new_key, str(v)))
    return dict(items)

# Compare
for r in results:
    try:
        # Load ground truth and prediction JSONs
        gt_json = json.loads(r["ground_truth"])
    except Exception as e:
        print(f"Invalid GT JSON in ID {r['id']}: {e}")
        continue

    pred_json = r["prediction"]
    if isinstance(pred_json, str):
        try:
            pred_json = json.loads(pred_json)
        except json.JSONDecodeError:
            print(f"Invalid Prediction JSON in ID {r['id']}. Prediction: {pred_json}")
            continue
    elif not isinstance(pred_json, Mapping): # Ensure it's a dictionary for flattening
        print(f"Prediction for ID {r['id']} is not a valid JSON object or dict. Prediction: {pred_json}")
        continue

    # Extract nested gt_parse only for both ground truth and prediction
    gt_flat = flatten_dict(gt_json.get("gt_parse", {}))
    pred_flat = flatten_dict(pred_json.get("gt_parse", {}))

    print(f"\n--- ID: {r['id']} ---")
    total_fields = len(gt_flat)
    correct_matches = 0
    total_cer = 0.0

    for key in gt_flat:
        gt_val = gt_flat[key]
        pred_val = pred_flat.get(key, "") # Get predicted value, default to empty string if not found

        field_cer = cer(gt_val, pred_val)
        total_cer += field_cer

        if gt_val.strip() == pred_val.strip():
            correct_matches += 1

        print(f"{key}: CER={field_cer:.2f} | GT='{gt_val}' | Pred='{pred_val}'")

    avg_cer = total_cer / total_fields if total_fields else 1.0
    acc = correct_matches / total_fields if total_fields else 0.0

    print(f"\nAccuracy (Exact Match): {acc:.2%} | Avg CER: {avg_cer:.2f}")

Enter fullscreen mode Exit fullscreen mode

Explanation of Evaluation Metrics:

  • flatten_dict: This helper function is crucial for comparing nested JSON structures. It converts dictionaries like {"gt_parse": {"invoice_number": "123", "line_items": [{"description": "Item A"}]}} into a flat dictionary with compound keys: {"gt_parse.invoice_number": "123", "gt_parse.line_items[0].description": "Item A"}. This allows for straightforward field-by-field comparison.
  • Character Error Rate (CER): Calculated for each field, it tells us how "close" the predicted text is to the ground truth at a character level. A CER of 0.00 means a perfect match.
  • Accuracy (Exact Match): This metric specifically counts how many fields were extracted perfectly, meaning the predicted value exactly matches the ground truth value after stripping whitespace. This is particularly important for critical fields like invoice numbers or total amounts where even a single character error can invalidate the data.

Expected Outcomes and Why This Matters

When running this evaluation with a robust OCR model like Gemini, you would ideally observe:

  • Low Average CER: Indicating that Gemini is highly accurate at recognizing individual characters and words across the invoice.
  • High Accuracy (Exact Match): Especially for key fields like invoice_number, date, and total_amount. These fields are critical for automated processing and downstream systems. For example, if Gemini consistently extracts "12345" as the invoice number when the ground truth is "12345", that's a perfect exact match and a CER of 0.
  • Intelligent Extraction: Beyond just character accuracy, Gemini's multimodal understanding should enable it to correctly map extracted text to the right fields, even if the layout varies across invoices. For instance, correctly identifying the total amount even if it's styled differently on different invoices.

Let's consider an example for a single invoice:

Ground Truth (gt_parse):

{
  "invoice_number": "INV-2025-001",
  "date": "2025-06-15",
  "total_amount": "150.75",
  "line_items": [
    {
      "description": "Consulting Services",
      "quantity": "1",
      "unit_price": "100.00",
      "amount": "100.00"
    },
    {
      "description": "Travel Expenses",
      "quantity": "1",
      "unit_price": "50.75",
      "amount": "50.75"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Gemini Prediction (gt_parse):

{
  "invoice_number": "INV-2025-001",
  "date": "2025-06-15",
  "total_amount": "150.75",
  "line_items": [
    {
      "description": "Consulting Services",
      "quantity": "1",
      "unit_price": "100.00",
      "amount": "100.00"
    },
    {
      "description": "Travel Expenses",
      "quantity": "1",
      "unit_price": "50.75",
      "amount": "50.75"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

In this ideal scenario, all fields would have a CER of 0.00 and contribute to 100% exact match accuracy.

Now consider a less ideal scenario:

Gemini Prediction (gt_parse):

{
  "invoice_number": "INV-2025-01", // Missing a '0'
  "date": "2025-06-15",
  "total_amount": "150.75",
  "line_items": [
    {
      "description": "Consulting Sercices", // Typo
      "quantity": "1",
      "unit_price": "100.00",
      "amount": "100.00"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Here, "invoice_number" and "line_items[0].description" would show a non-zero CER, and would not count towards exact match accuracy. The "total_amount" and "date" fields, if correctly extracted, would still contribute to exact match accuracy and have a CER of 0.00. This granular evaluation helps pinpoint areas where the OCR model might need further refinement or where certain document layouts pose greater challenges.

Conclusion: Unlocking Automation with Intelligent OCR

Evaluating OCR models like Gemini against structured datasets such as invoices-donut-data-v1 is not just an academic exercise. It's a critical step in building robust, automated document processing workflows. By systematically measuring performance using metrics like CER and exact match accuracy, we can:

  • Validate Model Performance: Objectively determine how well Gemini handles invoice OCR.
  • Identify Strengths and Weaknesses: Pinpoint specific fields or document variations where Gemini excels or struggles.
  • Drive Improvement: Use the insights to refine prompts, fine-tune models, or implement post-processing steps to achieve even higher accuracy.

The ability of multimodal AI models like Gemini to not just "read" text but to "understand" documents is transformative for business automation. By rigorously testing and evaluating these capabilities, we move closer to a future where manual data entry from invoices becomes a relic of the past, freeing up human potential for more strategic and creative endeavors.

Top comments (0)