Browse documentation

Getting Started

Concepts

Guides

Operations

Memory-pressure protection

Install in 5 minutes

Install the operator, deploy a model from the catalog, and hit an OpenAI-compatible endpoint. Works on a fresh kind, Minikube, or Docker Desktop cluster, and on real GPU clusters.

The recording below is the full happy path on a fresh kind cluster: deploy, wait for ready, query.

Real-time recording on a kind cluster. ~50 seconds wall time, idle waits compressed.

Prerequisites

A Kubernetes cluster v1.27 or later (Minikube, kind, GKE, EKS, AKS, or bare-metal)
kubectl installed and pointed at the cluster
Helm 3.0+ if installing via Helm (recommended)
Cluster admin permissions to install CRDs

1. Install the CLI

The llmkube CLI deploys models with one command and wraps the most-used kubectl operations.

Install script (macOS, Linux)

curl -sSL https://raw.githubusercontent.com/defilantech/llmkube/main/install.sh | bash

Detects OS and architecture; installs the latest release.

macOS via Homebrew

brew tap defilantech/tap
brew install llmkube

Windows

Download the Windows binary from the latest release, extract, and add to PATH.

Verify

llmkube version

2. Install the operator

Install the operator into the cluster using Helm (recommended) or Kustomize.

Helm (recommended)

helm repo add llmkube https://defilantech.github.io/LLMKube
helm repo update

helm install llmkube llmkube/llmkube \
  --namespace llmkube-system \
  --create-namespace

kubectl get pods -n llmkube-system

Helm chart README →

Kustomize

git clone https://github.com/defilantech/llmkube.git
cd llmkube
kubectl apply -k config/default

kubectl get pods -n llmkube-system

3. Deploy a model

Pick something from the built-in catalog or point at any GGUF on Hugging Face.

From the catalog

llmkube catalog list

llmkube catalog info phi-4-mini

llmkube deploy phi-4-mini

From a Hugging Face URL

llmkube deploy tinyllama \
  --source https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  --cpu 500m \
  --memory 1Gi

Watch the rollout

llmkube status phi-4-mini

kubectl wait --for=condition=available --timeout=300s inferenceservice/phi-4-mini

The operator downloads the model into the per-namespace cache, creates an init container to load it, and exposes an OpenAI-compatible HTTP endpoint behind a Service.

4. Test the API

Port-forward and send a chat request.

kubectl port-forward svc/phi-4-mini 8080:8080

In another terminal:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Explain Kubernetes in one sentence"}],
    "max_tokens": 50
  }'

5. Use with the OpenAI SDK

LLMKube exposes the OpenAI Chat Completions shape; any OpenAI SDK works.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",  # LLMKube does not require API keys
)

response = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "What is Kubernetes?"}],
)

print(response.choices[0].message.content)

Works with LangChain, LlamaIndex, the OpenAI SDKs (Python, Node.js, Go), and any tool that speaks OpenAI's API.

GPU acceleration

On clusters with NVIDIA GPUs, deploy with --gpu; the operator schedules onto a GPU node and selects a CUDA-built runtime image. See the GPU setup guide for prerequisites.

llmkube deploy llama-3.1-8b --gpu --gpu-count 1

llmkube catalog info llama-3.1-8b   # see GPU requirements

Benchmark numbers on the README →

Troubleshooting

Model stuck in "Downloading" state

Check the init container logs for download progress:

kubectl logs <pod-name> -c model-downloader

Confirm the cluster has internet access or the source URL is reachable. Multi-gigabyte models can take several minutes.

Pod crashes with OOMKilled

Bump the memory request:

llmkube deploy <model> --memory 4Gi

A safe rule of thumb is at least 1.2× the GGUF file size, plus headroom for the KV cache.

GPU not detected

Verify the NVIDIA GPU operator is running:

kubectl get pods -n gpu-operator-resources

Check the GPU-node label (GKE example):

kubectl get nodes -l cloud.google.com/gke-accelerator

API requests time out

Confirm the pod is running and inspect server logs:

kubectl get pods -l app=<model-name>
kubectl logs <pod-name> -c llama-server

For larger models or longer prompts, raise resource requests and probe timeouts on the InferenceService.

More edge cases live in the Minikube quickstart troubleshooting section.

Architecture — how the controller and metal-agent split responsibility
CRD reference — fields on Model and InferenceService
Memory-pressure protection — the metal-agent watchdog and eviction model
GPU setup — installing the NVIDIA stack on a fresh cluster