I recently aced an LLM engineer interview with a 9/10 score, and one of the tasks was designing a customer support chatbot using Retrieval-Augmented Generation (RAG). Here’s how I tackled it, building a scalable, privacy-focused system that delivers fast, accurate responses. Let’s dive into creating a RAG-based chatbot, step by step, with open-source tools and a user-friendly interface.
Table of Contents
- Introduction
- Requirements
- System Architecture
- Model Selection
- RAG Implementation
- Deployment Strategy
- Monitoring and Analytics
- Cost Estimation
- Challenges and Mitigations
- Conclusion
- Resources
Introduction
Chatbots are game-changers for customer support, automating answers and saving time. Unlike old-school chatbots with rigid responses, RAG combines a retriever to fetch relevant data and a generator to craft natural replies. In this guide, we’ll build a chatbot that answers queries using company documents, supports 100+ users, keeps latency under 2 seconds, and stays cost-effective. We’ll use open-source models from Hugging Face and a slick Gradio interface.
Requirements
The task was to design a production-ready customer support chatbot that uses fine-tuned large language models (LLMs) to answer queries based on company documentation. Below are the specific requirements and deliverables.
Task Description: Design a production-ready system for a customer support chatbot that uses fine-tuned LLMs to answer queries based on company documentation.
Requirements:
- The system should use open-source models to maintain data privacy.
- It must handle 100+ concurrent users.
- Responses should be grounded in company documentation (no hallucinations).
- Response latency should be under 2 seconds.
- Usage analytics should track query types and satisfaction.
- The system should be cost-effective and scalable.
Deliverables: Provide a detailed design document that includes:
- Architecture diagram showing all components and data flow.
- Model selection justification (which models and why).
- RAG implementation details for retrieving relevant documentation.
- Deployment strategy including infrastructure and scaling approach.
- Monitoring and evaluation plan.
- Cost estimation and optimization strategies.
- Potential challenges and mitigation approaches.
System Architecture
The chatbot’s architecture is modular and built for scale. Here’s how it works:
- Frontend (React): Users interact via a web interface, uploading documents or asking questions.
- API Gateway (NGINX): Handles HTTP requests and WebSocket connections for real-time chats.
- Kafka: Streams queries and logs for processing and analytics.
- Backend Service (FastAPI): Manages queries, document uploads, and coordination.
- LLM Service: Runs fine-tuned Llama-2 models (7B and 13B) on AWS GPUs.
- RAG System: Uses Milvus for vector search and BGE-M3 for embeddings.
- Databases: MinIO for documents, Milvus for embeddings, MongoDB for metadata and logs.
- Analytics & Monitoring: Tracks queries and performance with Prometheus and Grafana.
How Data Flows:
- Users upload documents (up to 5, max 10 MB each) via the frontend.
- The backend stores files in MinIO, extracts text with Apache Tika, generates embeddings with BGE-M3, and saves them in Milvus.
- For queries, the backend retrieves relevant documents from Milvus, picks the right LLM (Llama-7B or 13B) based on query complexity, and generates a response.
- Responses stream back to users via WebSockets, while Kafka logs queries for analytics.
Diagram (simplified):
[React Frontend] --> [NGINX Gateway] --> [Kafka] --> [FastAPI Backend]
|--> [Milvus (RAG)]
|--> [Llama-2 (LLM)]
|--> [MinIO, MongoDB]
|--> [Prometheus, Grafana]
Model Selection
We chose open-source models for privacy and performance:
- Llama-2 (7B and 13B): Great for natural language tasks, multilingual support, and fine-tuning with tools like Unsloth. The 7B model handles simple queries; 13B tackles complex ones.
- BGE-M3: A multilingual embedding model for accurate document retrieval across 100+ languages.
- sentence-transformers/all-MiniLM-L6-v2: A lightweight model to gauge query complexity, routing easy queries to Llama-7B and tough ones to Llama-13B.
Why These Models?:
- Llama-2 is open-source, ensuring data privacy, and performs well for RAG tasks.
- BGE-M3 excels in multilingual embeddings, perfect for diverse documents.
- MiniLM-L6-v2 is fast and efficient for query routing.
RAG Implementation
RAG ensures responses stick to company documents, avoiding made-up answers:
-
Document Processing:
- Users upload files (doc, pdf, text) via the frontend.
- FastAPI stores files in MinIO, extracts text with Apache Tika, and splits it into chunks.
- BGE-M3 creates embeddings, stored in Milvus with metadata in MongoDB.
-
Query Handling:
- Queries are converted to embeddings with BGE-M3.
- Milvus retrieves top-k documents, filtered by user-selected files if specified.
- MiniLM-L6-v2 checks query-document similarity to pick the LLM.
- The LLM generates a response using an anti-hallucination prompt: “Answer only from the provided context or say ‘I don’t know.’”
- If no relevant documents are found, the user gets: “No relevant information found.”
-
Optimizations:
- Cache frequent queries to reduce LLM calls.
- Use two-step retrieval for precision.
- Update Milvus regularly for new documents.
Deployment Strategy
We deploy on AWS for scalability:
- LLM Service: Runs on g4dn.12xlarge instances (4 T4 GPUs) with 4-bit quantization for efficiency.
- Other Services: Use EC2 instances (e.g., m5.large for FastAPI, t3.medium for databases).
- Tools: Docker for containerization, Kubernetes for orchestration, and NGINX for load balancing.
- Scalability: Five g4dn instances handle 100+ users, with Kubernetes auto-scaling based on load.
- Latency: Milvus retrieval (~0.1s) + LLM generation (~1.5s) keeps responses under 2 seconds.
Monitoring and Analytics
We track performance and user satisfaction:
- Monitoring: Prometheus and Grafana monitor latency, CPU, and GPU usage.
- Analytics: Kafka logs queries, and a custom pipeline categorizes query types and collects feedback (e.g., rating buttons).
- Evaluation: Periodic audits ensure response accuracy as documents change.
Cost Estimation
Here’s the monthly cost breakdown:
Component | Description | Cost (~USD) |
---|---|---|
LLM Service | 5 x g4dn.12xlarge | $1,417 |
Backend Service | 2 x m5.large | $201 |
MongoDB | t3.medium | $16.52 |
MinIO | t3.medium | $16.52 |
Milvus | m5.large | $100.80 |
Kafka | 3 x m5.large | $301.44 |
Analytics | t3.medium | $16.52 |
Total | $2,069.80 |
Cost-Saving Tips:
- Use 4-bit quantization to fit models on T4 GPUs.
- Cache queries and batch requests with vLLM.
- Consider AWS spot instances for non-critical tasks.
Challenges and Mitigations
Challenge | Mitigation |
---|---|
Irrelevant Retrieval | Fine-tune BGE-M3, allow document selection. |
No Relevant Documents | Notify users and offer human support fallback. |
Outdated Models | Automate fine-tuning with new documents. |
Scalability Issues | Use Kubernetes auto-scaling and load testing. |
Security/Privacy | Encrypt data and restrict access to internal networks. |
Cost Overruns | Monitor with Cast AI and optimize resources. |
Conclusion
This RAG-based chatbot delivers fast, accurate, and scalable customer support by combining Llama-2’s language prowess with BGE-M3’s retrieval accuracy. It handles 100+ users with sub-2-second responses, stays grounded in company documents, and keeps costs low with open-source models and AWS optimization. Whether you’re automating support or exploring RAG, this setup is a solid starting point, adaptable to evolving needs.
Top comments (1)
🤖 Loved the clear explanation of Llama-2 and BGE-M3. Great job on the architecture!