DEV Community

Samuel Oladejo
Samuel Oladejo

Posted on

Democratizing AI: Building a RAG System with Gemma and LangChain

Presented at DevFest 2024, Johannesburg

Artificial Intelligence is no longer the exclusive domain of massive corporations or high-end research labs. With the rise of open-source models like Gemma and modular frameworks like LangChain, it is now possible to build robust, domain-specific AI systems using publicly available tools.

At DevFest Johannesburg 2024, I walked through how to build a Retrieval-Augmented Generation (RAG) system using Gemma, LangChain, and Vertex AI — with a working chatbot demo and live deployment.


What is Retrieval-Augmented Generation (RAG)?

RAG is an AI architecture that combines:

  • Information retrieval (search, databases, document indexes)
  • Generative language models (like Gemma or Gemini)

This enables systems to access external knowledge, retrieve relevant content, and use that content to ground the generation of responses, improving both factual accuracy and relevance.

Diagram: RAG Architecture


Why Use RAG?

Traditional language models are limited to what they were trained on. RAG addresses this limitation by adding real-time context from external sources. Some of the key advantages of RAG systems include:

  • Access to the latest information
  • Factual grounding and domain control
  • Semantic search with vector databases and relevance-based re-ranking
  • Enhanced response quality and trustworthiness

How It Works

RAG systems typically follow this pipeline:

1. Retrieval and Preprocessing

  • External documents (e.g., knowledge bases, PDFs, databases) are searched using vector similarity or keyword search.
  • Retrieved results are cleaned, tokenized, and filtered to prepare for use in generation.

Diagram: Retrieval and Pre-processing


2. Grounded Generation

  • Retrieved and preprocessed content is injected into the input context of the LLM.
  • The model then uses this grounded context to generate responses that are both accurate and contextually rich.

Diagram: Grounded Generation


Core Concepts in RAG

When building a RAG system, several key processes are involved:

Chunking

Splitting large documents into smaller chunks for indexing and retrieval.

Slide: Chunking


Embedding

Transforming text chunks into high-dimensional vectors using an embedding model, making them searchable by meaning.

Slide: Embedding


Indexing

Storing embeddings in a vector database like PostgreSQL with pgvector for fast similarity search.

Slide: Indexing


Retrieval

Searching for relevant chunks based on user input.

Slide: Retrieval


Grounding

Combining retrieved context with the user query to provide better background for the language model.

Slide: Grounding


Generation

Using the LLM (in our case, Gemma) to generate an answer based on the grounded context.

Slide: Generation


Building with LangChain and Gemma

We built the system using the following stack:

Component Technology
LLM Gemma (Open-source)
Framework LangChain + LangGraph
Storage PostgreSQL + pgvector
UI / API FastAPI + LangServe
Deployment Cloud Run (Serverless)

Diagram: LangChain Cloud Deployment

LangChain

LangChain provides modular tools for handling each step of the RAG pipeline — from data ingestion to generation chaining.

LangGraph

LangGraph allows building orchestrated workflows and agentic behaviors, essential for more complex RAG pipelines.

LangSmith

We used LangSmith to debug, monitor, and test our RAG applications throughout development.


Live Demo: Local RAG Chatbot with Gemma and Vertex AI

During the presentation, we built a fully functional RAG chatbot that:

  • Accepts user queries
  • Searches documents using vector search
  • Grounds the query with retrieved chunks
  • Generates an answer using Gemma

Screenshot: RAG Chatbot UI


Services Used

  • LangChain – Framework for building the RAG app
  • Cloud Run – Deployment of both the indexing and inference pipeline
  • PostgreSQL on Cloud SQL – Vector database using pgvector
  • FastAPI + LangServe – Simple and scalable REST API interface

Conclusion

With open models like Gemma and flexible frameworks like LangChain, building a RAG system has become far more accessible. You can now create AI applications that are not only intelligent but also grounded in real, trustworthy data — whether you're working on a chatbot, documentation assistant, or enterprise search tool.

AI is no longer locked behind proprietary walls. With the right tools and frameworks, you can build, test, and deploy contextual LLM-powered apps in a weekend.


Additional Resources


Feel free to reach out if you're building with LangChain or deploying RAG systems — always happy to connect and share ideas!

Top comments (0)