DEV Community

Hasanul Mukit
Hasanul Mukit

Posted on

A Simple Overview of The Modern RAG Developer’s Stack

Building or scaling AI-powered systems?

The Retrieval-Augmented Generation (RAG) approach is at the heart of many cutting-edge apps today. Here’s a concise, yet detailed breakdown of the modern RAG developer’s stack—everything you need to glue together LLMs, knowledge bases, and pipelines that actually work in production.

1. LLMs (Large Language Models)

You need a high-quality “brain” for your RAG system. Choose between:

  • Open models (e.g., Llama 3.3, Mistral)
    • Pros: No per-call API fees, full control over fine-tuning, on-prem deployment for data privacy.
    • Cons: You’re responsible for hosting, scaling, and updates.
  • API-driven models (OpenAI’s GPT-4, Anthropic’s Claude, Google’s Gemini)
    • Pros: Serverless, always up-to-date, SLA-backed.
    • Cons: Costs add up with scale; data residency concerns.

Tip: Start with an open model locally (e.g., Llama 3.3 on Ollama) and switch to an API for production as traffic grows.

2. Frameworks

Glue your components quickly—don’t reinvent the wheel:

  • LangChain

    • Provides chains (pipelines of prompts + logic), agents (LLM-driven decision makers), and built-in tools (search, calculators).
    • Example:
    from langchain import LLMChain, PromptTemplate
    from langchain.llms import OpenAI
    
    template = PromptTemplate.from_template("Summarize: {text}")
    chain = LLMChain(llm=OpenAI(), prompt=template)
    print(chain.run(text="LangChain makes RAG easy!"))
    
  • LlamaIndex (formerly GPT Index)

    • Builds document indices for fast retrieval, supports custom embeddings and query modes.
  • Haystack

    • An end-to-end RAG solution with Pipelines, Document Stores, and Inference APIs—great for multi-modal search (text, PDF, images).

Pro tip: Mix & match—use Haystack’s document stores with LangChain’s chains for ultimate flexibility.

RAG Developer's Stack

3. Vector Databases

Your chunked knowledge needs a home with lightning-fast similarity search. Top contenders:

Database Highlights
Chroma Simple Python API, great for prototyping
Qdrant Rust-based, WebSocket streaming, geo search
Weaviate GraphQL & REST APIs, modular indexing plugins
Milvus High-performance, GPU acceleration

Choosing criteria: query throughput, indexing speed, storage cost, and multi-tenant support. Always benchmark with your own data!

4. Data Extraction

Feeding RAG means ingesting knowledge from diverse sources:

  • Web scraping: FireCrawl, MegaParser for JavaScript-rendered sites.
  • Document parsing: Docling, Apache Tika, or PDFMiner to extract text from PDFs, DOCX, and more.
  • APIs & databases: Custom connectors—GraphQL, SQL, NoSQL—to pull in structured data.

Workflow: crawl → clean → chunk → embed. Automate each step in your ETL pipeline (e.g., Airflow, Dagster).

5. LLM Access Layers

Decouple your code from specific providers:

  • Open LLM Hosts: Hugging Face (Inference API & Hub), Ollama (local containers), Together AI (community models).
  • Cloud Providers: OpenAI, Google Vertex AI (Gemini), Anthropic (Claude).

Why it matters: swapping providers should be as easy as changing one config file.

6. Text Embeddings

Quality of retrieval hinges on embeddings. Popular models:

  • Sentence-BERT (SBERT): fast, widely used for semantic similarity.
  • BGE (BigGraphEmbeddings): optimized for large-scale corpora.
  • OpenAI Embeddings: strong accuracy, but paid.
  • Google’s Embedding API: balanced cost/performance.
  • Cohere Embeddings: competitive pricing, simple SDK.

Best practice: evaluate embedding models by measuring recall@k and mrr (mean reciprocal rank) on your own retrieval tasks.

7. Evaluation

You can’t improve what you don’t measure. Key tools & metrics:

  • Tools
    • RAGas: end-to-end RAG evaluation pipelines.
    • Giskard: model testing with explainability & bias detection.
    • TruLens: LLM observability—track prompts, tokens, and outcomes.
  • Metrics
    • Relevance: Precision@k, Recall@k
    • Accuracy: Exact match, ROUGE, BLEU
    • Latency & Cost: Avg response time, tokens per request
    • Quality: Human evaluations, coherence, hallucination rate

Dashboard idea: log eval metrics to Grafana/Prometheus for continuous monitoring.

Visual Overview

+------------+ +--------------+ +-------------+
| LLM/API    |<--->| Framework    |<--->| Vector DB   |
+------------+     +--------------+     +-------------+
              ↑              ↑
Access Layer     Chains & Embeds       Agents
(OpenAI, HF)       (SBERT, BGE)
              ↓              ↓
+-----------------------------------------------+
|        Data Extraction → ETL → Chunking       |
+-----------------------------------------------+

Evaluation
(RAGas, Giskard, TruLens / Metrics)
Enter fullscreen mode Exit fullscreen mode

Whether you’re prototyping or scaling, this modern RAG stack ensures you have the right building blocks for high-performance, reliable AI applications.

Ready to spin up your next RAG project? Drop a comment or share your favorite tool!

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.