Build a Website Knowledge Chatbot Using Streamlit, ChromaDB, Olostep, and OpenAI

Have you ever wanted a smart AI assistant that understands your entire website and can answer questions like ChatGPT? In this tutorial, we’ll show you how to build it — without training your own LLM or managing any backend.

We’ll use:

✅ Olostep to crawl and extract website content

✅ ChromaDB to store and search content embeddings with metadata

✅ OpenAI (v1.7.6) for embeddings and GPT-4 summarization

✅ Streamlit to build a live chatbot UI

Perfect for product sites, documentation portals, and landing pages.

🔧 What You'll Need

pip install streamlit openai==1.7.6 chromadb requests

🧠 How It Works

Crawl website pages using Olostep’s API
Clean content and extract Markdown
Embed each page with OpenAI embeddings
Store everything in ChromaDB (including metadata)
Let users ask questions via Streamlit
Query top matches and summarize answers with GPT

🧩 Step-by-Step Implementation

1. Crawl Website

def start_crawl(url):
    payload = {
        "start_url": url,
        "include_urls": ["/**"],
        "max_pages": 10,
        "max_depth": 3
    }
    headers = {"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"}
    res = requests.post("https://api.olostep.com/v1/crawls", headers=headers, json=payload)
    return res.json()["id"]

2. Wait and Retrieve Pages

def wait_for_crawl(crawl_id):
    while True:
        res = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}", headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"})
        if res.json()["status"] == "completed":
            break
        time.sleep(30)

def get_pages(crawl_id):
    res = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}/pages", headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"})
    return res.json()["pages"]

3. Clean Markdown Content

import re

def clean_markdown(markdown):
    markdown = re.sub(r'#+ |\* |\> ', '', markdown)
    markdown = re.sub(r'\[(.*?)\]\(.*?\)', r'\1', markdown)
    markdown = re.sub(r'`|\*\*|_', '', markdown)
    markdown = re.sub(r'\n{2,}', '\n', markdown)
    return markdown.strip()

4. Initialize ChromaDB

import chromadb
from chromadb.utils import embedding_functions

chroma_client = chromadb.Client()
openai_embed_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key="YOUR_OPENAI_API_KEY",
    model_name="text-embedding-ada-002"
)

collection = chroma_client.get_or_create_collection(
    name="website-content",
    embedding_function=openai_embed_fn
)

5. Index Content

def retrieve_markdown(retrieve_id):
    res = requests.get("https://api.olostep.com/v1/retrieve",
        headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"},
        params={"retrieve_id": retrieve_id, "formats": ["markdown"]})
    return res.json().get("markdown_content", "")

def index_content(pages):
    for page in pages:
        try:
            markdown = retrieve_markdown(page["retrieve_id"])
            text = clean_markdown(markdown)
            if len(text) > 20:
                collection.add(
                    documents=[text],
                    metadatas=[{"url": page["url"]}],
                    ids=[page["retrieve_id"]]
                )
                print(f"✅ Indexed: {page['url']}")
        except Exception as e:
            print(f"⚠️ Error: {e}")

6. Summarize with GPT

import openai
client = openai.OpenAI(api_key="YOUR_OPENAI_API_KEY")

def summarize_with_gpt(question, chunks):
    if not chunks:
        return "Sorry, I couldn't find enough information."

    prompt = f'''
Use the following website content to answer this question:

{''.join(chunks)}

Q: {question}
A:
'''

    res = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.4,
    )
    return res.choices[0].message.content

7. Streamlit Frontend

import streamlit as st

st.set_page_config(page_title="Website Chatbot", page_icon="💬")
st.title("💬 Website Chatbot")
st.caption("Ask anything based on your website content.")

question = st.chat_input("Ask your question...")

if question:
    with st.chat_message("user"):
        st.markdown(question)

    with st.spinner("Thinking..."):
        chunks = query_website(question)
        final_answer = summarize_with_gpt(question, chunks)

    with st.chat_message("assistant"):
        st.markdown(final_answer)

✅ Live Demo Preview

Ask: What services do you offer?
Ask: Where is your pricing page?
Ask: How can I contact support?

The assistant will generate answers using real indexed content from your website.

🧠 Next Steps

Save/load your ChromaDB collection
Split large documents into smaller chunks
Include source URLs in GPT responses
Add memory to handle multi-turn chat

🎯 Conclusion

Congratulations! You've now built a fully functional AI-powered chatbot that can answer questions from your website using ChromaDB, Olostep, and OpenAI — all wrapped in a beautiful Streamlit app.

Whether for internal docs, support, or public knowledge, this gives you ChatGPT power without managing any LLMs.

Happy building! 🚀
https://gist.github.com/mdehsan873/f69481997f487e23b1d1282c82ce00f5