Gowsiya Syednoor Shek

Posted on May 25

Build a Local GenAI API with Docker Model Runner and FastAPI (Part 3)

#docker #genai #llm

In Part 2, I ran an LLM locally using Docker Model Runner and connected to it through a Python script. Now in Part 3, we are wrapping that logic inside a FastAPI REST API - giving us a real, local GenAI backend we can use from Postman, web apps, or CLI tools.

Let’s dive in.

Goal

Build a FastAPI server that sends prompts to a locally running LLM (ai/mistral)
Expose a /generate endpoint
Run the API container and Docker Model Runner side-by-side

What We Built

A REST API (running in Docker) that talks to Docker Model Runner via an OpenAI-compatible endpoint. You send a prompt like:

{
  "prompt": ""Explain what is docker model runner in 3 points"
}

…and it responds with an AI-generated answer from a model running 100% on your machine.

Project Structure

docker-llm-fastapi-app/
├── app/
│   └── main.py         ← FastAPI logic
├── Dockerfile          ← API container
├── docker-compose.yml  ← Orchestration
├── README.md

Check out the code here part3-code

How to Run It

1. Pull and Start the Model

docker model pull ai/mistral
docker model run ai/mistral

If you have already pulled the model from previous tutorial, running pull again is not necessary.
You may see: Interactive chat mode started. Type '/bye' to exit.

That’s okay — the API is still active behind the scenes if TCP access is enabled.

2. Start the FastAPI Server

docker compose up --build

You’ll see:

Uvicorn running on http://0.0.0.0:8000

Call the Endpoint

Send a request from Postman or curl:

POST http://localhost:8000/generate
Content-Type: application/json

{
  "prompt": "What is MLOps in simple terms?"
}

Output:

{
  "response": "MLOps, short for Machine Learning Operations, is a practice for collaboration and..."
}

Things I Learned

1. Interactive Mode Still Enables API

Even though Docker says:

Interactive chat mode started. Type '/bye' to exit.

…the HTTP API is still available on localhost:12434. As long as TCP support is enabled in Docker Desktop, it works fine.

2. First Call Is Slow

The first request took ~2 minutes. Why?

The model is loaded into memory
Runtime warmup takes time

But after that, future prompts are little better.

What’s Next

In Part 4 I plan to build a - Prompt Templates + Role Options which adds a practical layer of prompt engineering to your GenAI app.

Stay tuned!

DEV Community