Browse documentation
Install in 5 minutes
Install the operator, deploy a model from the catalog, and hit an OpenAI-compatible endpoint. Works on a fresh kind, Minikube, or Docker Desktop cluster, and on real GPU clusters.
The recording below is the full happy path on a fresh kind cluster: deploy, wait for ready, query.
Prerequisites
- A Kubernetes cluster v1.27 or later (Minikube, kind, GKE, EKS, AKS, or bare-metal)
kubectlinstalled and pointed at the clusterHelm3.0+ if installing via Helm (recommended)- Cluster admin permissions to install CRDs
1. Install the CLI
The llmkube CLI deploys models with one command and wraps the most-used kubectl operations.
Install script (macOS, Linux)
curl -sSL https://raw.githubusercontent.com/defilantech/llmkube/main/install.sh | bash Detects OS and architecture; installs the latest release.
macOS via Homebrew
brew tap defilantech/tap
brew install llmkube Windows
Download the Windows binary from the latest release, extract, and add to PATH.
Verify
llmkube version2. Install the operator
Install the operator into the cluster using Helm (recommended) or Kustomize.
Helm (recommended)
helm repo add llmkube https://defilantech.github.io/LLMKube
helm repo update
helm install llmkube llmkube/llmkube \
--namespace llmkube-system \
--create-namespace
kubectl get pods -n llmkube-system Kustomize
git clone https://github.com/defilantech/llmkube.git
cd llmkube
kubectl apply -k config/default
kubectl get pods -n llmkube-system 3. Deploy a model
Pick something from the built-in catalog or point at any GGUF on Hugging Face.
From the catalog
llmkube catalog list
llmkube catalog info phi-4-mini
llmkube deploy phi-4-mini From a Hugging Face URL
llmkube deploy tinyllama \
--source https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--cpu 500m \
--memory 1Gi Watch the rollout
llmkube status phi-4-mini
kubectl wait --for=condition=available --timeout=300s inferenceservice/phi-4-mini The operator downloads the model into the per-namespace cache, creates an init container to load it, and exposes an OpenAI-compatible HTTP endpoint behind a Service.
4. Test the API
Port-forward and send a chat request.
kubectl port-forward svc/phi-4-mini 8080:8080 In another terminal:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Explain Kubernetes in one sentence"}],
"max_tokens": 50
}'5. Use with the OpenAI SDK
LLMKube exposes the OpenAI Chat Completions shape; any OpenAI SDK works.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed", # LLMKube does not require API keys
)
response = client.chat.completions.create(
model="phi-4-mini",
messages=[{"role": "user", "content": "What is Kubernetes?"}],
)
print(response.choices[0].message.content) Works with LangChain, LlamaIndex, the OpenAI SDKs (Python, Node.js, Go), and any tool that speaks OpenAI's API.
GPU acceleration
On clusters with NVIDIA GPUs, deploy with --gpu; the operator schedules onto a GPU node and selects a CUDA-built runtime image. See the GPU setup guide for prerequisites.
llmkube deploy llama-3.1-8b --gpu --gpu-count 1
llmkube catalog info llama-3.1-8b # see GPU requirements Troubleshooting
Model stuck in "Downloading" state
Check the init container logs for download progress:
kubectl logs <pod-name> -c model-downloader Confirm the cluster has internet access or the source URL is reachable. Multi-gigabyte models can take several minutes.
Pod crashes with OOMKilled
Bump the memory request:
llmkube deploy <model> --memory 4Gi A safe rule of thumb is at least 1.2× the GGUF file size, plus headroom for the KV cache.
GPU not detected
Verify the NVIDIA GPU operator is running:
kubectl get pods -n gpu-operator-resources Check the GPU-node label (GKE example):
kubectl get nodes -l cloud.google.com/gke-acceleratorAPI requests time out
Confirm the pod is running and inspect server logs:
kubectl get pods -l app=<model-name>
kubectl logs <pod-name> -c llama-server For larger models or longer prompts, raise resource requests and probe timeouts on the InferenceService.
More edge cases live in the Minikube quickstart troubleshooting section.
Next
- Architecture — how the controller and metal-agent split responsibility
- CRD reference — fields on Model and InferenceService
- Memory-pressure protection — the metal-agent watchdog and eviction model
- GPU setup — installing the NVIDIA stack on a fresh cluster