🚀 How I Aced My LLM Interview: Building a RAG Chatbot

#rag #mrzaizai2k #tooling #interview

I recently aced an LLM engineer interview with a 9/10 score, and one of the tasks was designing a customer support chatbot using Retrieval-Augmented Generation (RAG). Here’s how I tackled it, building a scalable, privacy-focused system that delivers fast, accurate responses. Let’s dive into creating a RAG-based chatbot, step by step, with open-source tools and a user-friendly interface.

Introduction

Chatbots are game-changers for customer support, automating answers and saving time. Unlike old-school chatbots with rigid responses, RAG combines a retriever to fetch relevant data and a generator to craft natural replies. In this guide, we’ll build a chatbot that answers queries using company documents, supports 100+ users, keeps latency under 2 seconds, and stays cost-effective. We’ll use open-source models from Hugging Face and a slick Gradio interface.

Requirements

The task was to design a production-ready customer support chatbot that uses fine-tuned large language models (LLMs) to answer queries based on company documentation. Below are the specific requirements and deliverables.

Task Description: Design a production-ready system for a customer support chatbot that uses fine-tuned LLMs to answer queries based on company documentation.

Requirements:

The system should use open-source models to maintain data privacy.
It must handle 100+ concurrent users.
Responses should be grounded in company documentation (no hallucinations).
Response latency should be under 2 seconds.
Usage analytics should track query types and satisfaction.
The system should be cost-effective and scalable.

Deliverables: Provide a detailed design document that includes:

Architecture diagram showing all components and data flow.
Model selection justification (which models and why).
RAG implementation details for retrieving relevant documentation.
Deployment strategy including infrastructure and scaling approach.
Monitoring and evaluation plan.
Cost estimation and optimization strategies.
Potential challenges and mitigation approaches.

System Architecture

The chatbot’s architecture is modular and built for scale. Here’s how it works:

Frontend (React): Users interact via a web interface, uploading documents or asking questions.
API Gateway (NGINX): Handles HTTP requests and WebSocket connections for real-time chats.
Kafka: Streams queries and logs for processing and analytics.
Backend Service (FastAPI): Manages queries, document uploads, and coordination.
LLM Service: Runs fine-tuned Llama-2 models (7B and 13B) on AWS GPUs.
RAG System: Uses Milvus for vector search and BGE-M3 for embeddings.
Databases: MinIO for documents, Milvus for embeddings, MongoDB for metadata and logs.
Analytics & Monitoring: Tracks queries and performance with Prometheus and Grafana.

How Data Flows:

Users upload documents (up to 5, max 10 MB each) via the frontend.
The backend stores files in MinIO, extracts text with Apache Tika, generates embeddings with BGE-M3, and saves them in Milvus.
For queries, the backend retrieves relevant documents from Milvus, picks the right LLM (Llama-7B or 13B) based on query complexity, and generates a response.
Responses stream back to users via WebSockets, while Kafka logs queries for analytics.

Diagram (simplified):

[React Frontend] --> [NGINX Gateway] --> [Kafka] --> [FastAPI Backend]
                                              |--> [Milvus (RAG)]
                                              |--> [Llama-2 (LLM)]
                                              |--> [MinIO, MongoDB]
                                              |--> [Prometheus, Grafana]

Model Selection

We chose open-source models for privacy and performance:

Llama-2 (7B and 13B): Great for natural language tasks, multilingual support, and fine-tuning with tools like Unsloth. The 7B model handles simple queries; 13B tackles complex ones.
BGE-M3: A multilingual embedding model for accurate document retrieval across 100+ languages.
sentence-transformers/all-MiniLM-L6-v2: A lightweight model to gauge query complexity, routing easy queries to Llama-7B and tough ones to Llama-13B.

Why These Models?:

Llama-2 is open-source, ensuring data privacy, and performs well for RAG tasks.
BGE-M3 excels in multilingual embeddings, perfect for diverse documents.
MiniLM-L6-v2 is fast and efficient for query routing.

RAG Implementation

RAG ensures responses stick to company documents, avoiding made-up answers:

Document Processing:
- Users upload files (doc, pdf, text) via the frontend.
- FastAPI stores files in MinIO, extracts text with Apache Tika, and splits it into chunks.
- BGE-M3 creates embeddings, stored in Milvus with metadata in MongoDB.
Query Handling:
- Queries are converted to embeddings with BGE-M3.
- Milvus retrieves top-k documents, filtered by user-selected files if specified.
- MiniLM-L6-v2 checks query-document similarity to pick the LLM.
- The LLM generates a response using an anti-hallucination prompt: “Answer only from the provided context or say ‘I don’t know.’”
- If no relevant documents are found, the user gets: “No relevant information found.”
Optimizations:
- Cache frequent queries to reduce LLM calls.
- Use two-step retrieval for precision.
- Update Milvus regularly for new documents.

Deployment Strategy

We deploy on AWS for scalability:

LLM Service: Runs on g4dn.12xlarge instances (4 T4 GPUs) with 4-bit quantization for efficiency.
Other Services: Use EC2 instances (e.g., m5.large for FastAPI, t3.medium for databases).
Tools: Docker for containerization, Kubernetes for orchestration, and NGINX for load balancing.
Scalability: Five g4dn instances handle 100+ users, with Kubernetes auto-scaling based on load.
Latency: Milvus retrieval (~0.1s) + LLM generation (~1.5s) keeps responses under 2 seconds.

Monitoring and Analytics

We track performance and user satisfaction:

Monitoring: Prometheus and Grafana monitor latency, CPU, and GPU usage.
Analytics: Kafka logs queries, and a custom pipeline categorizes query types and collects feedback (e.g., rating buttons).
Evaluation: Periodic audits ensure response accuracy as documents change.

Cost Estimation

Here’s the monthly cost breakdown:

Component	Description	Cost (~USD)
LLM Service	5 x g4dn.12xlarge	$1,417
Backend Service	2 x m5.large	$201
MongoDB	t3.medium	$16.52
MinIO	t3.medium	$16.52
Milvus	m5.large	$100.80
Kafka	3 x m5.large	$301.44
Analytics	t3.medium	$16.52
Total		$2,069.80

Cost-Saving Tips:

Use 4-bit quantization to fit models on T4 GPUs.
Cache queries and batch requests with vLLM.
Consider AWS spot instances for non-critical tasks.

Challenges and Mitigations

Challenge	Mitigation
Irrelevant Retrieval	Fine-tune BGE-M3, allow document selection.
No Relevant Documents	Notify users and offer human support fallback.
Outdated Models	Automate fine-tuning with new documents.
Scalability Issues	Use Kubernetes auto-scaling and load testing.
Security/Privacy	Encrypt data and restrict access to internal networks.
Cost Overruns	Monitor with Cast AI and optimize resources.

Conclusion

This RAG-based chatbot delivers fast, accurate, and scalable customer support by combining Llama-2’s language prowess with BGE-M3’s retrieval accuracy. It handles 100+ users with sub-2-second responses, stays grounded in company documents, and keeps costs low with open-source models and AWS optimization. Whether you’re automating support or exploring RAG, this setup is a solid starting point, adaptable to evolving needs.