Why serving large language models is hard ... and how vLLM and KServe can help

#llm

This article was originally published on IBM Developer.

My title "Fast inference and furious scaling," obviously inspired by the movies, is not just catchy but also captures the pace of generative AI technology. With new optimization techniques, tools emerging daily, and not enough "good first issues," generative AI is a rapidly evolving landscape that often leaves beginners struggling to find their footing.

In this article, beginners who are new to the world of LLM inferencing and serving can learn about why it's a complicated thing to do and gain a clearer idea of how to get started using two open source tools: vLLM and KServe. Instead of bogging down in technical details, the focus of this article is on the 'why' and 'how' of LLM inferencing and serving to give a background context for those who want to participate, while including resources for further deep-diving linked along the way.

What does it mean to "serve" an LLM?

Model serving boils down to making a pre-trained model useable. When you try a cloud-based service like ChatGPT, models have been made available for you to send prompts (that is, inference requests) and receive a response (that is, output). Those models are being served for you to consume. Behind the scenes, the model has been made available through an API.

Sounds simple, right? Well, not quite.

Serving LLMs isn't just about wrapping them in an API. Just like any memory-intensive software program, large models bring performance trade-offs, infrastructure constraints, and cost considerations.

Here are just a few reasons that model serving is not simple: