DEV Community

Cover image for 🚀 How I Aced My LLM Interview: Building a RAG Chatbot
Mai Chi Bao
Mai Chi Bao

Posted on

🚀 How I Aced My LLM Interview: Building a RAG Chatbot

I recently aced an LLM engineer interview with a 9/10 score, and one of the tasks was designing a customer support chatbot using Retrieval-Augmented Generation (RAG). Here’s how I tackled it, building a scalable, privacy-focused system that delivers fast, accurate responses. Let’s dive into creating a RAG-based chatbot, step by step, with open-source tools and a user-friendly interface.

Table of Contents

Introduction

Chatbots are game-changers for customer support, automating answers and saving time. Unlike old-school chatbots with rigid responses, RAG combines a retriever to fetch relevant data and a generator to craft natural replies. In this guide, we’ll build a chatbot that answers queries using company documents, supports 100+ users, keeps latency under 2 seconds, and stays cost-effective. We’ll use open-source models from Hugging Face and a slick Gradio interface.

Requirements

The task was to design a production-ready customer support chatbot that uses fine-tuned large language models (LLMs) to answer queries based on company documentation. Below are the specific requirements and deliverables.

Task Description: Design a production-ready system for a customer support chatbot that uses fine-tuned LLMs to answer queries based on company documentation.

Requirements:

  • The system should use open-source models to maintain data privacy.
  • It must handle 100+ concurrent users.
  • Responses should be grounded in company documentation (no hallucinations).
  • Response latency should be under 2 seconds.
  • Usage analytics should track query types and satisfaction.
  • The system should be cost-effective and scalable.

Deliverables: Provide a detailed design document that includes:

  • Architecture diagram showing all components and data flow.
  • Model selection justification (which models and why).
  • RAG implementation details for retrieving relevant documentation.
  • Deployment strategy including infrastructure and scaling approach.
  • Monitoring and evaluation plan.
  • Cost estimation and optimization strategies.
  • Potential challenges and mitigation approaches.

System Architecture

The chatbot’s architecture is modular and built for scale. Here’s how it works:

  • Frontend (React): Users interact via a web interface, uploading documents or asking questions.
  • API Gateway (NGINX): Handles HTTP requests and WebSocket connections for real-time chats.
  • Kafka: Streams queries and logs for processing and analytics.
  • Backend Service (FastAPI): Manages queries, document uploads, and coordination.
  • LLM Service: Runs fine-tuned Llama-2 models (7B and 13B) on AWS GPUs.
  • RAG System: Uses Milvus for vector search and BGE-M3 for embeddings.
  • Databases: MinIO for documents, Milvus for embeddings, MongoDB for metadata and logs.
  • Analytics & Monitoring: Tracks queries and performance with Prometheus and Grafana.

How Data Flows:

  1. Users upload documents (up to 5, max 10 MB each) via the frontend.
  2. The backend stores files in MinIO, extracts text with Apache Tika, generates embeddings with BGE-M3, and saves them in Milvus.
  3. For queries, the backend retrieves relevant documents from Milvus, picks the right LLM (Llama-7B or 13B) based on query complexity, and generates a response.
  4. Responses stream back to users via WebSockets, while Kafka logs queries for analytics.

Diagram (simplified):

[React Frontend] --> [NGINX Gateway] --> [Kafka] --> [FastAPI Backend]
                                              |--> [Milvus (RAG)]
                                              |--> [Llama-2 (LLM)]
                                              |--> [MinIO, MongoDB]
                                              |--> [Prometheus, Grafana]
Enter fullscreen mode Exit fullscreen mode

Model Selection

We chose open-source models for privacy and performance:

  • Llama-2 (7B and 13B): Great for natural language tasks, multilingual support, and fine-tuning with tools like Unsloth. The 7B model handles simple queries; 13B tackles complex ones.
  • BGE-M3: A multilingual embedding model for accurate document retrieval across 100+ languages.
  • sentence-transformers/all-MiniLM-L6-v2: A lightweight model to gauge query complexity, routing easy queries to Llama-7B and tough ones to Llama-13B.

Why These Models?:

  • Llama-2 is open-source, ensuring data privacy, and performs well for RAG tasks.
  • BGE-M3 excels in multilingual embeddings, perfect for diverse documents.
  • MiniLM-L6-v2 is fast and efficient for query routing.

RAG Implementation

RAG ensures responses stick to company documents, avoiding made-up answers:

  1. Document Processing:
    • Users upload files (doc, pdf, text) via the frontend.
    • FastAPI stores files in MinIO, extracts text with Apache Tika, and splits it into chunks.
    • BGE-M3 creates embeddings, stored in Milvus with metadata in MongoDB.
  2. Query Handling:
    • Queries are converted to embeddings with BGE-M3.
    • Milvus retrieves top-k documents, filtered by user-selected files if specified.
    • MiniLM-L6-v2 checks query-document similarity to pick the LLM.
    • The LLM generates a response using an anti-hallucination prompt: “Answer only from the provided context or say ‘I don’t know.’”
    • If no relevant documents are found, the user gets: “No relevant information found.”
  3. Optimizations:
    • Cache frequent queries to reduce LLM calls.
    • Use two-step retrieval for precision.
    • Update Milvus regularly for new documents.

Deployment Strategy

We deploy on AWS for scalability:

  • LLM Service: Runs on g4dn.12xlarge instances (4 T4 GPUs) with 4-bit quantization for efficiency.
  • Other Services: Use EC2 instances (e.g., m5.large for FastAPI, t3.medium for databases).
  • Tools: Docker for containerization, Kubernetes for orchestration, and NGINX for load balancing.
  • Scalability: Five g4dn instances handle 100+ users, with Kubernetes auto-scaling based on load.
  • Latency: Milvus retrieval (~0.1s) + LLM generation (~1.5s) keeps responses under 2 seconds.

Monitoring and Analytics

We track performance and user satisfaction:

  • Monitoring: Prometheus and Grafana monitor latency, CPU, and GPU usage.
  • Analytics: Kafka logs queries, and a custom pipeline categorizes query types and collects feedback (e.g., rating buttons).
  • Evaluation: Periodic audits ensure response accuracy as documents change.

Cost Estimation

Here’s the monthly cost breakdown:

Component Description Cost (~USD)
LLM Service 5 x g4dn.12xlarge $1,417
Backend Service 2 x m5.large $201
MongoDB t3.medium $16.52
MinIO t3.medium $16.52
Milvus m5.large $100.80
Kafka 3 x m5.large $301.44
Analytics t3.medium $16.52
Total $2,069.80

Cost-Saving Tips:

  • Use 4-bit quantization to fit models on T4 GPUs.
  • Cache queries and batch requests with vLLM.
  • Consider AWS spot instances for non-critical tasks.

Challenges and Mitigations

Challenge Mitigation
Irrelevant Retrieval Fine-tune BGE-M3, allow document selection.
No Relevant Documents Notify users and offer human support fallback.
Outdated Models Automate fine-tuning with new documents.
Scalability Issues Use Kubernetes auto-scaling and load testing.
Security/Privacy Encrypt data and restrict access to internal networks.
Cost Overruns Monitor with Cast AI and optimize resources.

Conclusion

This RAG-based chatbot delivers fast, accurate, and scalable customer support by combining Llama-2’s language prowess with BGE-M3’s retrieval accuracy. It handles 100+ users with sub-2-second responses, stays grounded in company documents, and keeps costs low with open-source models and AWS optimization. Whether you’re automating support or exploring RAG, this setup is a solid starting point, adaptable to evolving needs.

Resources

finetune-tinyllama-lora/system_design

Top comments (1)

Collapse
 
mrzaizai2k profile image
Mai Chi Bao

🤖 Loved the clear explanation of Llama-2 and BGE-M3. Great job on the architecture!