Skip to content

Blog

Insights, tutorials, and updates from the LLMKube team

Releases 8 min read

What we shipped in LLMKube 0.7.9: a new mlx-server runtime for Apple Silicon, four bugs the autoscaling tutorial flushed out, and kubectl scale support

0.7.9 adds mlx-server as a first-class runtime on the metal-agent: an OpenAI-compatible MLX inference server you select with --runtime mlx-server. We dogfooded it serving Qwen3.6-35B-A3B-8bit to opencode on an M5 Max, fixed four real bugs that surfaced while building toward a metrics-driven autoscaling tutorial (a dead PodMonitor selector, the operator fighting the HPA, the Metal-path InferenceService never going Ready, and a skipped memory pre-flight), and landed a Kubernetes scale subresource so kubectl scale works on InferenceService. Here's what landed.

Christopher Maher
Christopher Maher
Read more
Releases 8 min read

What we shipped in LLMKube 0.7.8: ModelRouter Phase 1, fail-closed PII routing, and a hybrid local + cloud agentic story

0.7.8 lands ModelRouter Phase 1: a single OpenAI-compatible endpoint that dispatches across local InferenceServices and external providers (Anthropic, OpenAI, LiteLLM, Bedrock, Vertex), with fail-closed semantics for regulated data, per-rule and per-backend timeouts, half-open circuit breaker, streaming SSE passthrough, and a structured audit log per request. Plus the supporting fixes that made this release ship-ready, three new docs guides, and an honest list of Phase 1 limitations. Here's what landed.

Christopher Maher
Christopher Maher
Read more
Releases 7 min read

What we shipped in LLMKube 0.7.7: OpenShift first-class, vllm-swift + TurboQuant, and a community-shipped Longhorn fix

0.7.7 makes OpenShift a first-class deploy target, lands the vllm-swift runtime with TurboQuant KV cache passthrough on Apple Silicon, picks up two community-driven changes (vLLM tuning fields from an engineer in France, plus a Longhorn FSGroup fix from a user who filed the cleanest bug report of the year), and adds enough observability glue to make multi-runtime fleets legible. Here's what landed and the story behind it.

Christopher Maher
Christopher Maher
Read more
Releases 8 min read

What we shipped in LLMKube 0.7.6: memory-pressure protection, mutable modelRef, and a community PR worth celebrating

0.7.6 is the biggest LLMKube release since multi-GPU sharding landed. Memory-pressure protection on the metal-agent (priority-based eviction with a friendly-fire guard), modelRef finally mutable, ParallelSlots extended to vLLM thanks to a polished community PR from @Faylixe, three new K8s-native pod fields (runtimeClassName, podAnnotations, podLabels), a real CNCF-style docs site, plus a quickstart-killer caught and fixed Saturday night. Here's what landed.

Christopher Maher
Christopher Maher
Read more
Benchmarks 16 min read

vllm-swift on M5 Max: A/B'ing TurboQuant+ against the llama.cpp data

TheTom asked us to run his vllm-swift TurboQuant+ work through the same kind of sweep we did on the llama.cpp fork. 36 cells, then a deep-context follow-up out to 192K. fp16 wins per-seq decode at every cell where it runs, but hits the memory ceiling at d=128K B=32 and d=192K B=32. turbo4v2 runs both: 1,360 tok/s and 1,024 tok/s aggregate. That is the value-prop confirmation: TurboQuant+ on this engine on this hardware is a memory-ceiling tool, not a throughput accelerator. Honest numbers below.

Christopher Maher
Christopher Maher
Read more
Benchmarks 11 min read

TurboQuant on a MacBook Pro, part 2: perplexity, KL divergence, and asymmetric K/V on M5 Max

Followup to the M5 Max long-context post. Comments asked for perplexity, KL divergence, asymmetric K/V combos, and a 64K data point. Overnight bench delivered all four. q8_0 KV is essentially free at 4k context (KL 0.0016, top-1 token agreement 98.6%). -ctk q8_0 -ctv turbo4 matches symmetric q8_0 throughput and fits 512K where symmetric q8_0 OOM'd. -ctk f16 -ctv turbo4 hits a Metal kernel fallback and craters 78x at 128K.

Christopher Maher
Christopher Maher
Read more
Benchmarks 10 min read

TurboQuant on a MacBook Pro: two findings the upstream discussion missed

Built TheTom's TurboQuant fork of llama.cpp for Metal, ran the bench overnight on M5 Max, and surfaced two findings the upstream community thread didn't have. First: at 128K+ context, turbo3 (3-bit KV) beats q8_0 (8-bit KV) on prompt processing. Second: turbo3 and turbo4 split by phase, turbo3 wins prefill, turbo4 wins decode at long context. Plus 1M context for batch coding workloads on a MacBook, and two PRs back to LLMKube to make TurboQuant first-class on the InferenceService CRD.

Christopher Maher
Christopher Maher
Read more
Benchmarks 12 min read

62.2% on Aider Polyglot from a MacBook Pro. Then the other model we tried scored 4%. Here's what actually happened, with a working cost loop attached.

Qwen3.6-35B-A3B Q8 on a MacBook Pro M5 Max scored 62.2% on Aider Polyglot (n=225/225), beating Claude Sonnet 4 with 32k thinking, o1-high, and DeepSeek R1 on the official leaderboard. Then Devstral 2 scored 4% on the same harness but 81.7% on HumanEval+: same model, 20× swing, benchmark numbers don't transfer. Plus the InferCost Apple Silicon collector that landed today, validating live cost-per-token attribution end to end with sub-watt agreement to the agent gauge.

Christopher Maher
Christopher Maher
Read more
Engineering 8 min read

Why Qwen 3.6 Doesn't Need --cpu-moe (and Why Qwen3-Coder Does) on Dual 16GB

The --cpu-moe flag trades VRAM savings for CPU compute cost per token. On dual RTX 5060 Ti cards that trade is required to run Qwen3-Coder-30B at all, but pure overhead for Qwen 3.6-35B-A3B, whose DeltaNet attention keeps the KV cache small enough that the model already fits in VRAM. Same hardware, same flag, opposite correct answers. Plus what shipped in LLMKube 0.7.0 because of the thread that surfaced this.

Christopher Maher
Christopher Maher
Read more
LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.