DEV Community

Alain Airom
Alain Airom

Posted on

Your Desktop AI: Seamlessly Running Granite with LiteLLM and Ollama

This tutorial offers an introductory hands-on experience using LiteLLM, Ollama, and Granite, perfect for beginning your journey with open LLMs.

Image description

What is LiteLLM?

Image description

The world of Large Language Models (LLMs) is rapidly expanding, with an ever-growing array of powerful models and providers. Navigating this diverse landscape can be a challenge, often requiring different APIs and data formats for each. This is where litellm steps in as a game-changer, offering a unified and simplified approach to interacting with 100+ LLMs.

LiteLLM streamlines your LLM integration by allowing you to call over 100 LLMs using the familiar OpenAI Input/Output Format. It intelligently translates your inputs to the specific completion, embedding, and image_generation endpoints of each provider, ensuring a consistent experience. No more wrestling with varied API specifications; litellm guarantees a consistent output, with text responses always available at ['choices'][0]['message']['content']
.
Beyond this seamless interaction, litellm enhances reliability with retry/fallback logic across multiple deployments, managed efficiently through its Router. For those mindful of costs, the LiteLLM Proxy Server provides robust capabilities to track spend and set budgets per project.

In this blog post, we’ll dive into an introductory hands-on experience using Litellm, Ollama, and Granite to demonstrate how easily you can leverage the power of local LLMs.

Using Ollama

Image description

Ollama is an open-source framework that simplifies the process of running large language models (LLMs) on your local machine. Think of it as a user-friendly wrapper that handles the complexities of model weights, configurations, and even provides a clean API layer on top of them. It’s essentially like Docker for LLMs, allowing you to “pull” pre-built model packages and run them with minimal hassle. The appeal of using Ollama locally is multifaceted: first and foremost, it offers unparalleled privacy and security, as your data never leaves your machine, a critical advantage for sensitive applications or personal use. Secondly, it eliminates cloud API costs and latency, providing faster, more cost-efficient inference directly on your hardware. This local control also grants immense flexibility, allowing you to experiment with various open-source models, customize their behavior, and even fine-tune them to your specific needs, all within your own environment. Whether you’re a developer prototyping an AI application, a researcher exploring new model capabilities, or simply someone who values data privacy, Ollama makes the power of local LLMs accessible and practical.

Granite LLM family and using with Ollama

Image description

Beyond popular open-source models, the landscape of LLMs also features robust, enterprise-grade offerings like the IBM Granite open foundation models. These models, developed by IBM Research, are designed with business applications in mind, prioritizing performance, trust, and scalability. What makes them particularly exciting for local development is their availability directly on the Ollama website and through the Ollama library. This means that developers can easily pull and run various Granite models — ranging from general-purpose language models (like granite3.2:latest) to specialized versions for code intelligence (granite-code) or even embedding tasks (granite-embedding)—directly on their local machines using the familiar ollama pull command. This accessibility democratizes access to sophisticated, business-ready AI capabilities, enabling rapid prototyping, secure data processing, and cost-effective experimentation without relying solely on cloud-based APIs.

Putting it all together with a simple application

The simplicity of the code provided in this guide serves a clear purpose: to vividly demonstrate the remarkable ease with which you can build powerful LLM-based applications locally. By bringing together the versatility of litellm for unified API calls, the robust capabilities of IBM Granite models, and the streamlined local deployment offered by Ollama, developers can quickly set up a functional environment. This hands-on example is designed to remove initial barriers, allowing you to focus on logic and interaction rather than complex infrastructure. It's an ideal starting point for anyone looking to experiment, prototype, or develop AI solutions securely and cost-effectively on their own machine, before scaling to a more advanced or production-ready level.

  • Prepare your environment.
python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip
pip install litellm

# assuming ollama is installed!
ollama run granite3.2:latest
Enter fullscreen mode Exit fullscreen mode
  • Run the code.
from litellm import completion

def run_chat_application():
    """
    Runs a console-based chat application that interacts with an LLM.
    The application quits when the user types 'quit' or 'exit'.
    """
    print("Welcome to the LLM Chat Application!")
    print("Type 'quit' or 'exit' to end the conversation.")
    print("-" * 50)

    while True:
        user_input = input("You: ").strip().lower()

        if user_input in ["quit", "exit"]:
            print("Exiting chat. Goodbye!")
            break

        try:
            # Send the user's message to the LLM
            response = completion(

                model="ollama/granite3.2:latest",
                messages=[{"content": user_input, "role": "user"}],
                api_base="http://localhost:11434"
            )


            if response and response.choices and response.choices[0].message:
                llm_response = response.choices[0].message.content
                print(f"LLM: {llm_response}")
            else:
                print("LLM: No response received or an unexpected response format.")

        except Exception as e:
            print(f"An error occurred: {e}")
            print("Please ensure your Ollama server is running and accessible at http://localhost:11434.")

if __name__ == "__main__":
    run_chat_application()
Enter fullscreen mode Exit fullscreen mode
  • Hereafter the output ⬇️
> python main2.py
Welcome to the LLM Chat Application!
Type 'quit' or 'exit' to end the conversation.
--------------------------------------------------
You: hello
LLM: Hello! How can I assist you today?
You: what can you do?
LLM: As an advanced AI model named Granite, I can assist with a wide range of tasks and provide information on virtually any topic. Here are some examples:

1. **General Knowledge**: I can answer questions about history, science, literature, and more. 

2. **Language Translation**: I can translate text from one language to another.

3. **Text Summarization**: I can summarize lengthy texts into concise versions.

4. **Creative Writing**: I can help generate ideas for stories, poems, or other creative works.

5. **Educational Support**: I can explain complex concepts in simpler terms to aid learning.

6. **Coding Assistance**: While not a code executor, I can help debug simple code snippets and explain programming concepts.

7. **Recommendation Systems**: Based on user preferences, I can suggest books, movies, music, etc.

8. **Trivia and Fun Facts**: I can provide interesting trivia or fun facts about almost any topic.

9. **Grammar and Punctuation Checks**: I can help identify errors in sentences for grammar and punctuation.

10. **Mental Health Support (within limits)**: I can offer basic mental health support, such as helping to recognize signs of certain disorders or suggesting coping strategies, but should not replace professional help.

Please note that while I strive for accuracy, I'm capable of errors and my knowledge is limited to the data I was trained on (up to 2021). For personal advice, medical queries, or legal matters, it's always best to consult with a relevant professional.
You: 
Enter fullscreen mode Exit fullscreen mode

Conclusion

This guide has showcased a powerful and accessible integration: running enterprise-grade IBM Granite models locally using Ollama, seamlessly managed by LiteLLM. We’ve seen how this combination allows you to develop LLM applications with privacy, cost-efficiency, and unparalleled flexibility, all while leveraging the familiar OpenAI API format. This setup provides an excellent foundation for building and experimenting with AI on your own terms. We’re excited about the potential this integration unlocks and plan to delve deeper into more advanced topics in future blog posts, exploring fine-tuning, RAG implementations, and deploying these local LLM solutions at scale. Stay tuned for more insights into the evolving world of local AI!

Links

Top comments (0)