Integrating large language models (LLMs) into production systems often reveals a fundamental challenge: their outputs are inherently unstructured and unpredictable.
Whether it's missing fields, malformed formats, or incorrect data types, these inconsistencies hinder reliability and scalability.
The solution? Leverage Pydantic, a Python library that enables runtime data validation using type annotations.
With LLMs like MistralAI (and many others) supporting structured outputs via JSON schemas, combining these tools ensures AI-generated data adheres to strict schemas.
In this guide, weβll walk through a simple but practical real-world example that:
- Uses the Mistral API to generate structured JSON from a CSV input.
- Validates that output using a
Pydantic
model. - Implements a retry mechanism for failed validation attempts with an improved prompt.
Full source code available at: https://github.com/nunombispo/PydanticLLMs-Article
SPONSORED By Python's Magic Methods - Beyond init and str
This book offers an in-depth exploration of Python's magic methods, examining the mechanics and applications that make these features essential to Python's design.
Understanding Pydantic
Pydantic is a powerful data validation and parsing library in Python, built around the concept of using standard Python type hints to define data models.
At its core, Pydantic enforces that incoming data matches the specified schema, automatically converting types and raising errors when expectations aren't met.
Originally developed for use with web frameworks like FastAPI, Pydantic has found widespread adoption in domains where data integrity and clarity are critical, including AI and machine learning workflows.
By turning Python classes into data contracts, Pydantic helps eliminate the guesswork often associated with dynamic or external inputs.
Key Features
- Runtime Type Checking: Pydantic enforces type annotations at runtime, ensuring that all incoming data adheres to the expected types. If a mismatch occurs, it raises detailed validation errors that are easy to debug.
- Automatic Data Parsing and Serialization: Whether you receive input as strings, dictionaries, or nested structures, Pydantic will automatically parse and coerce data into the appropriate Python objects. It can also serialize models back to JSON or dictionaries for API responses or storage.
- Integration with Python Type Hints: Models are defined using familiar Python syntax with type annotations, making it intuitive for developers to describe complex data shapes. This also enables static analysis tools and IDEs to provide better support and autocomplete suggestions.
The Importance of Structured Outputs in AI
There are three main points when defining the importance of structured outputs for AI responses.
Consistency
Structured outputs provide a consistent format for AI-generated data, which is essential for seamless downstream processing.
When data adheres to a predefined schema, it becomes straightforward to parse, transform, and integrate into various systems such as databases, APIs, or analytic pipelines.
Consistency eliminates guesswork, reduces the need for custom error-prone parsing logic, and enables automation at scale.
Reliability
AI models, LLMs, can generate diverse and unpredictable outputs.
This variability can lead to failures if systems expect data in a specific format but receive something unexpected instead.
By enforcing structure through validation, the risk of runtime errors, crashes, or corrupted data is significantly reduced.
Reliable data outputs increase confidence in the AI systemβs behavior, making it safer to deploy in production environments.
Security
Unvalidated or poorly structured inputs and outputs can expose applications to security vulnerabilities such as injection attacks, malformed data exploitation, or denial-of-service scenarios.
Structured data validation acts as a safeguard, ensuring that only well-formed, type-safe data is accepted and processed.
This reduces the attack surface and helps maintain the integrity and confidentiality of AI-driven systems.
Practical Example: From CSV to Validated JSON
Let's consider the example of processing a CSV file containing data about users, which can have some incomplete data, into structured JSON representing user profiles.
In terms of flow, we will implement this logic:
Define Pydantic Model
import os
import json
from pydantic import BaseModel, ValidationError
from mistralai import Mistral
# -----------------------------
# Pydantic Model for Validation
# -----------------------------
class Person(BaseModel):
name: str
age: int
email: str
Here we define a simple Pydantic model to ensure the schema of our intended JSON.
Function to Call MistralAI API with JSON Mode
# --------------------------------------
# Function to Call Mistral in JSON Mode
# --------------------------------------
def call_mistral_json_mode(prompt: str, system_message: str = "") -> str:
"""Call the Mistral API with a prompt and optional system message, expecting a JSON object response."""
api_key = os.environ.get("MISTRAL_API_KEY")
if not api_key:
raise RuntimeError("Please set the MISTRAL_API_KEY environment variable.")
model = "mistral-large-latest"
client = Mistral(api_key=api_key)
messages = [
{"role": "user", "content": prompt},
{"role": "system", "content": system_message},
]
chat_response = client.chat.complete(
model=model,
messages=messages,
response_format={"type": "json_object"},
)
return chat_response.choices[0].message.content
Here, we are calling the Mistral API with the model mistral-large-latest
and enforcing the response to be a json_object
.
Note: Mistral AI provides a chat.parse
method that receives a Pydantic model directly as the response_format
. For this example, I kept the logic generic so it can be used with other LLMs.
Read CSV file
# -----------------------------
# Read CSV Input from File
# -----------------------------
with open("example_incomplete.csv", "r", encoding="utf-8") as f:
csv_input = f.read().strip()
Here, we simply read the CSV into a variable.
The CSV used in this example is:
name,age,email
Alice,30,[email protected]
Bob,,[email protected]
Charlie,40,
Diana,25,[email protected]
Initial Prompt and AI Response
# -----------------------------
# Initial Prompt Construction
# -----------------------------
model_json_schema = Person.model_json_schema()
prompt = f"""
Given the following CSV data, return a JSON array of objects with fields: {Person.model_json_schema()}
CSV:
{csv_input}
Example output:
[
{{"name": "Alice", "age": 30, "email": "[email protected]"}},
{{"name": "Bob", "age": 25, "email": "[email protected]"}}
]
"""
print("\n" + "="*50)
print("Mistral CSV to Structured Example: Attempt 1")
print("="*50 + "\n")
response = call_mistral_json_mode(prompt)
print("Mistral response:\n", response)
Here, we are defining the initial prompt and calling the Mistral API with the helper function call_mistral_json_mode
.
Validation and Retry Loop
# -----------------------------
# Validation and Retry Loop
# -----------------------------
try:
data = json.loads(response)
people = [Person(**item) for item in data if isinstance(item, dict)]
print("\nValidated people:")
for person in people:
print(" ", person)
skipped = [item for item in data if not isinstance(item, dict)]
if skipped:
print("\nWarning: Skipped non-dict items:", skipped)
except (json.JSONDecodeError, ValidationError, TypeError) as e:
print("\nValidation error on initial attempt:", e)
attempt = 2
max_attempts = 10
last_response = response
last_error = e
while attempt <= max_attempts:
# Improved system message for retries
system_message = (
"You are a data cleaning and structuring assistant. "
f"Your job is to convert CSV data into a JSON array of objects with the fields: {model_json_schema}"
"If any data is missing or invalid, infer reasonable values or skip the row. "
"Always return valid JSON. Do not include any explanation, only the JSON array."
)
# Improved prompt with actionable instructions
improved_prompt = f"""
Given the following CSV data, return a JSON array of objects with fields: {model_json_schema}
CSV:
{csv_input}
Instructions:
1. For each row, create an object with {model_json_schema}.
2. If a field is missing or invalid, infer a reasonable value, but do not skip the row.
3. Ensure the output is a valid JSON array, with no extra text.
4. Use the last error and response to determine how to fix the error:
Last error: {str(last_error)}
Last response: {last_response}
Example output:
[
{{"name": "Alice", "age": 30, "email": "[email protected]"}},
{{"name": "Bob", "age": 25, "email": "[email protected]"}}
]
"""
print("\n" + "="*50)
print(f"Attempt {attempt}: Improved Prompt & System Message")
print("="*50 + "\n")
last_response = call_mistral_json_mode(improved_prompt, system_message=system_message)
print("Mistral response:\n", last_response)
try:
improved_data = json.loads(last_response)
people = [Person(**item) for item in improved_data if isinstance(item, dict)]
print("\nValidated people:")
for person in people:
print(" ", person)
skipped = [item for item in improved_data if not isinstance(item, dict)]
if skipped:
print("\nWarning: Skipped non-dict items:", skipped)
break
except (json.JSONDecodeError, ValidationError, TypeError) as e2:
print(f"\nValidation error on attempt {attempt}:", e2)
last_error = e2
last_response = improved_prompt
attempt += 1
else:
print("\nFailed to get valid structured data after multiple attempts.")
print("Last error:", last_error)
After retrieving the initial response from the Mistral API, this section tries to parse and validate the LLM's output against the Person
Pydantic schema.
If the data is invalid, it automatically retries up to 10 times, each time enhancing the prompt and system message based on previous errors.
This loop adds robust fault tolerance to LLM-based pipelines:
- Automatically retries with better instructions.
- Incorporates LLM feedback to correct itself.
- Fails gracefully with a clear explanation.
Running the Example
Before running the example, make sure to define the API key to access the Mistral API. You can do that in a .env
file:
MISTRAL_API_KEY=<YOUR_MISTRAL_API_KEY>
You can get an API key from the Mistral Console.
Then you can install the necessary requirements:
pip install mistralai pydantic
Now run the code, assuming you saved your code in a main.py
file:
python main.py
You should see an output similar to this:
==================================================
Mistral CSV to Structured Example: Attempt 1
==================================================
Mistral response:
[
{"name": "Alice", "age": 30, "email": "[email protected]"},
{"name": "Bob", "age": null, "email": "[email protected]"},
{"name": "Charlie", "age": 40, "email": null},
{"name": "Diana", "age": 25, "email": "[email protected]"}
]
Validation error on initial attempt: 1 validation error for Person
age
Input should be a valid integer [type=int_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.11/v/int_type
==================================================
Attempt 2: Improved Prompt & System Message
==================================================
Mistral response:
[
{"name": "Alice", "age": 30, "email": "[email protected]"},
{"name": "Bob", "age": 0, "email": "[email protected]"},
{"name": "Charlie", "age": 40, "email": ""},
{"name": "Diana", "age": 25, "email": "[email protected]"}
]
Validated people:
name='Alice' age=30 email='[email protected]'
name='Bob' age=0 email='[email protected]'
name='Charlie' age=40 email=''
name='Diana' age=25 email='[email protected]'
As you can see, on the second attempt, the AI recognizes the error and outputs correct placeholder values to provide a correctly formatted JSON output.
In my testing, the AI varies in getting the output valid, from 2 attempts to sometimes 7 or 8, but it does provide a valid JSON in the end.
Benefits and Limitations
Let's see some benefits of this approach:
- Enforces a strict structure for unpredictable outputs.
- Reduces debugging effort through automated validation.
- Improves user trust by guaranteeing consistent data.
However, as always, there are limitations:
- Prompt engineering requires care to match the schema structure.
- Multiple retries may impact performance in real-time applications.
- JSON generation can occasionally fail with ambiguous or edge-case inputs.
Conclusion
As LLMs become key components in production systems, structured validation is non-negotiable.
Tools like Pydantic make this not only feasible but elegant.
By guiding models with well-crafted prompts and enforcing structure via validation layers, you gain:
- Predictable outputs.
- Stronger data pipelines.
- Easier debugging and recovery.
If you're working with LLMs and haven't added validation yet, now's the time.
Full source code available at: https://github.com/nunombispo/PydanticLLMs-Article
Follow me on Twitter: https://twitter.com/DevAsService
Follow me on Instagram: https://www.instagram.com/devasservice/
Follow me on TikTok: https://www.tiktok.com/@devasservice
Follow me on YouTube: https://www.youtube.com/@DevAsService
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.