Blog

Insights, tutorials, and updates from the LLMKube team

Releases 8 min read

What we shipped in LLMKube 0.7.9: a new mlx-server runtime for Apple Silicon, four bugs the autoscaling tutorial flushed out, and kubectl scale support

0.7.9 adds mlx-server as a first-class runtime on the metal-agent: an OpenAI-compatible MLX inference server you select with --runtime mlx-server. We dogfooded it serving Qwen3.6-35B-A3B-8bit to opencode on an M5 Max, fixed four real bugs that surfaced while building toward a metrics-driven autoscaling tutorial (a dead PodMonitor selector, the operator fighting the HPA, the Metal-path InferenceService never going Ready, and a skipped memory pre-flight), and landed a Kubernetes scale subresource so kubectl scale works on InferenceService. Here's what landed.

Christopher Maher

Christopher Maher

Releases 8 min read

What we shipped in LLMKube 0.7.8: ModelRouter Phase 1, fail-closed PII routing, and a hybrid local + cloud agentic story

0.7.8 lands ModelRouter Phase 1: a single OpenAI-compatible endpoint that dispatches across local InferenceServices and external providers (Anthropic, OpenAI, LiteLLM, Bedrock, Vertex), with fail-closed semantics for regulated data, per-rule and per-backend timeouts, half-open circuit breaker, streaming SSE passthrough, and a structured audit log per request. Plus the supporting fixes that made this release ship-ready, three new docs guides, and an honest list of Phase 1 limitations. Here's what landed.

Christopher Maher

Christopher Maher

Releases 7 min read

What we shipped in LLMKube 0.7.7: OpenShift first-class, vllm-swift + TurboQuant, and a community-shipped Longhorn fix

0.7.7 makes OpenShift a first-class deploy target, lands the vllm-swift runtime with TurboQuant KV cache passthrough on Apple Silicon, picks up two community-driven changes (vLLM tuning fields from an engineer in France, plus a Longhorn FSGroup fix from a user who filed the cleanest bug report of the year), and adds enough observability glue to make multi-runtime fleets legible. Here's what landed and the story behind it.

Christopher Maher

Christopher Maher

Releases 8 min read

What we shipped in LLMKube 0.7.6: memory-pressure protection, mutable modelRef, and a community PR worth celebrating

0.7.6 is the biggest LLMKube release since multi-GPU sharding landed. Memory-pressure protection on the metal-agent (priority-based eviction with a friendly-fire guard), modelRef finally mutable, ParallelSlots extended to vLLM thanks to a polished community PR from @Faylixe, three new K8s-native pod fields (runtimeClassName, podAnnotations, podLabels), a real CNCF-style docs site, plus a quickstart-killer caught and fixed Saturday night. Here's what landed.

Christopher Maher

Christopher Maher

Benchmarks 16 min read

vllm-swift on M5 Max: A/B'ing TurboQuant+ against the llama.cpp data

TheTom asked us to run his vllm-swift TurboQuant+ work through the same kind of sweep we did on the llama.cpp fork. 36 cells, then a deep-context follow-up out to 192K. fp16 wins per-seq decode at every cell where it runs, but hits the memory ceiling at d=128K B=32 and d=192K B=32. turbo4v2 runs both: 1,360 tok/s and 1,024 tok/s aggregate. That is the value-prop confirmation: TurboQuant+ on this engine on this hardware is a memory-ceiling tool, not a throughput accelerator. Honest numbers below.

Christopher Maher

Christopher Maher

Benchmarks 11 min read

TurboQuant on a MacBook Pro, part 2: perplexity, KL divergence, and asymmetric K/V on M5 Max

Followup to the M5 Max long-context post. Comments asked for perplexity, KL divergence, asymmetric K/V combos, and a 64K data point. Overnight bench delivered all four. q8_0 KV is essentially free at 4k context (KL 0.0016, top-1 token agreement 98.6%). -ctk q8_0 -ctv turbo4 matches symmetric q8_0 throughput and fits 512K where symmetric q8_0 OOM'd. -ctk f16 -ctv turbo4 hits a Metal kernel fallback and craters 78x at 128K.

Christopher Maher

Christopher Maher

Benchmarks 10 min read

TurboQuant on a MacBook Pro: two findings the upstream discussion missed

Built TheTom's TurboQuant fork of llama.cpp for Metal, ran the bench overnight on M5 Max, and surfaced two findings the upstream community thread didn't have. First: at 128K+ context, turbo3 (3-bit KV) beats q8_0 (8-bit KV) on prompt processing. Second: turbo3 and turbo4 split by phase, turbo3 wins prefill, turbo4 wins decode at long context. Plus 1M context for batch coding workloads on a MacBook, and two PRs back to LLMKube to make TurboQuant first-class on the InferenceService CRD.

Christopher Maher

Christopher Maher

Benchmarks 12 min read

62.2% on Aider Polyglot from a MacBook Pro. Then the other model we tried scored 4%. Here's what actually happened, with a working cost loop attached.

Qwen3.6-35B-A3B Q8 on a MacBook Pro M5 Max scored 62.2% on Aider Polyglot (n=225/225), beating Claude Sonnet 4 with 32k thinking, o1-high, and DeepSeek R1 on the official leaderboard. Then Devstral 2 scored 4% on the same harness but 81.7% on HumanEval+: same model, 20× swing, benchmark numbers don't transfer. Plus the InferCost Apple Silicon collector that landed today, validating live cost-per-token attribution end to end with sub-watt agreement to the agent gauge.

Christopher Maher

Christopher Maher

Engineering 15 min read

We ran Qwen3.6-27B on $800 of consumer GPUs, day one: llama.cpp vs vLLM

A Kubernetes-native bake-off on 2× RTX 5060 Ti, published 48 hours after Tongyi Lab dropped the model. vLLM wins throughput by 3-4× at high concurrency; llama.cpp + TurboQuant serves a 43K-token prompt where vLLM caps at 16K. Plus a live InferCost UsageReport turning tokens into dollars so the "cheaper than the cloud" question has an honest answer.

Christopher Maher

Christopher Maher

Engineering 10 min read

I Sent the Agents Loose on My Kubernetes Operator. Here's What They Shipped.

I pointed a fleet of coding agents at LLMKube and told them to audit the repo and close what they found. Six hours later, 17 PRs had landed on main: the 1,567-line god controller was down to 356 lines, the install.sh that had been silently broken for eight months actually worked, and govulncheck was catching Go stdlib CVEs in CI.

Christopher Maher

Christopher Maher

Engineering 8 min read

Why Qwen 3.6 Doesn't Need --cpu-moe (and Why Qwen3-Coder Does) on Dual 16GB

The --cpu-moe flag trades VRAM savings for CPU compute cost per token. On dual RTX 5060 Ti cards that trade is required to run Qwen3-Coder-30B at all, but pure overhead for Qwen 3.6-35B-A3B, whose DeltaNet attention keeps the KV cache small enough that the model already fits in VRAM. Same hardware, same flag, opposite correct answers. Plus what shipped in LLMKube 0.7.0 because of the thread that surfaced this.

Christopher Maher

Christopher Maher

Engineering 9 min read

The Model I Deployed Wrote My Operator's Next Feature

Yesterday I shipped Phase 1 of hybrid GPU/CPU offloading for LLMKube. By this morning the model I deployed with that feature had written Phase 2 for me. Qwen 3.6, dual RTX 5060 Ti, 56 tool calls, 100% success, merged as PR #283.

Christopher Maher

Christopher Maher

Engineering 12 min read

Your Local LLM Can Write Code While You Sleep. Here's What Ours Built.

I left a coding agent running on consumer GPUs overnight. It built a Go REST API from scratch: 2,323 lines, 22 tests, zero API cost. Here's the full setup and what it means for private, off-hours development.

Christopher Maher

Christopher Maher

Engineering 7 min read

How We Got Native Metal GPU Performance in Kubernetes (Without Containers)

Containers on macOS can't access Metal GPUs. Instead of fighting that limitation, we inverted the architecture: Kubernetes orchestrates, Metal runs natively. Here's how the LLMKube Metal Agent works.

Christopher Maher

Christopher Maher

February 20, 2026

Community 6 min read

I Built a Text-to-SVG Pipeline Over a Weekend (And You Can Too)

I needed SVG illustrations for a side project and didn't want to pay per-image. So I chained a local LLM, Flux, and vtracer on my homelab to build Vecsmith, a self-hosted text-to-SVG pipeline. Here's how it works and how you can build your own.

Christopher Maher

Christopher Maher

February 8, 2026

Engineering 7 min read

Introducing CLI Benchmarks: Test Your LLM Deployments Like a Platform Engineer

LLMKube v0.4.9 ships with comprehensive CLI benchmarking: five predefined test suites, automated sweeps, and markdown reports. We ran it on ShadowStack and discovered when multi-GPU helps (and when it doesn't).

Christopher Maher

Christopher Maher

December 6, 2025

Benchmarks 6 min read

ShadowStack Stress Test: Running Production 32B Models on Consumer Hardware

We pushed ShadowStack to its limits with 32B parameter models. Here's how dual RTX 5060 Ti GPUs handled Qwen 2.5, Qwen Coder, and Qwen 3 at production scale with zero failures.

Christopher Maher

Christopher Maher

December 2, 2025

Engineering 7 min read

Why Ollama Breaks at Scale (And What to Do About It)

Ollama is great for local development, but it wasn't designed for production. We analyzed 200 GitHub issues to understand why multi-GPU setups fail and what alternatives exist.

Christopher Maher

Christopher Maher

November 28, 2025

Community 5 min read

Thanksgiving 2025: Gratitude, Benchmarks, and Building in the Open

On this Thanksgiving eve, we reflect on the journey of building LLMKube in the open, share our latest benchmark results (68.7 tok/s on Llama 3.2 3B!), and express gratitude to the community.

Christopher Maher

Christopher Maher

November 26, 2025

Engineering 6 min read

Multi-GPU Support Ships: First Run on ShadowStack

LLMKube v0.4.0 brings multi-GPU support with layer-based sharding. We tested it on ShadowStack and hit 44 tok/s on Llama 13B across dual RTX 5060 Ti GPUs.

Christopher Maher

Christopher Maher

November 25, 2025

Engineering 8 min read

Building ShadowStack: Our On-Prem LLM Testing Lab

Behind the scenes of building our bare-metal testing environment for air-gapped LLM deployments. Real hardware, real constraints, real testing.

Christopher Maher

Christopher Maher

November 19, 2025

Announcement 5 min read

Introducing LLMKube: Kubernetes for Local LLMs

Learn why we built LLMKube and how it brings production-grade orchestration to local AI workloads.

Christopher Maher

Christopher Maher

November 17, 2025