DEV Community

Mohammad Ehsan Ansari
Mohammad Ehsan Ansari

Posted on

Build a Website Knowledge Chatbot Using Streamlit, ChromaDB, Olostep, and OpenAI

Have you ever wanted a smart AI assistant that understands your entire website and can answer questions like ChatGPT? In this tutorial, we’ll show you how to build it — without training your own LLM or managing any backend.

We’ll use:

Olostep to crawl and extract website content

ChromaDB to store and search content embeddings with metadata

OpenAI (v1.7.6) for embeddings and GPT-4 summarization

Streamlit to build a live chatbot UI

Perfect for product sites, documentation portals, and landing pages.


🔧 What You'll Need

pip install streamlit openai==1.7.6 chromadb requests
Enter fullscreen mode Exit fullscreen mode

🧠 How It Works

  1. Crawl website pages using Olostep’s API
  2. Clean content and extract Markdown
  3. Embed each page with OpenAI embeddings
  4. Store everything in ChromaDB (including metadata)
  5. Let users ask questions via Streamlit
  6. Query top matches and summarize answers with GPT

🧩 Step-by-Step Implementation

1. Crawl Website

def start_crawl(url):
    payload = {
        "start_url": url,
        "include_urls": ["/**"],
        "max_pages": 10,
        "max_depth": 3
    }
    headers = {"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"}
    res = requests.post("https://api.olostep.com/v1/crawls", headers=headers, json=payload)
    return res.json()["id"]
Enter fullscreen mode Exit fullscreen mode

2. Wait and Retrieve Pages

def wait_for_crawl(crawl_id):
    while True:
        res = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}", headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"})
        if res.json()["status"] == "completed":
            break
        time.sleep(30)

def get_pages(crawl_id):
    res = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}/pages", headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"})
    return res.json()["pages"]
Enter fullscreen mode Exit fullscreen mode

3. Clean Markdown Content

import re

def clean_markdown(markdown):
    markdown = re.sub(r'#+ |\* |\> ', '', markdown)
    markdown = re.sub(r'\[(.*?)\]\(.*?\)', r'\1', markdown)
    markdown = re.sub(r'`|\*\*|_', '', markdown)
    markdown = re.sub(r'\n{2,}', '\n', markdown)
    return markdown.strip()
Enter fullscreen mode Exit fullscreen mode

4. Initialize ChromaDB

import chromadb
from chromadb.utils import embedding_functions

chroma_client = chromadb.Client()
openai_embed_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key="YOUR_OPENAI_API_KEY",
    model_name="text-embedding-ada-002"
)

collection = chroma_client.get_or_create_collection(
    name="website-content",
    embedding_function=openai_embed_fn
)
Enter fullscreen mode Exit fullscreen mode

5. Index Content

def retrieve_markdown(retrieve_id):
    res = requests.get("https://api.olostep.com/v1/retrieve",
        headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"},
        params={"retrieve_id": retrieve_id, "formats": ["markdown"]})
    return res.json().get("markdown_content", "")

def index_content(pages):
    for page in pages:
        try:
            markdown = retrieve_markdown(page["retrieve_id"])
            text = clean_markdown(markdown)
            if len(text) > 20:
                collection.add(
                    documents=[text],
                    metadatas=[{"url": page["url"]}],
                    ids=[page["retrieve_id"]]
                )
                print(f"✅ Indexed: {page['url']}")
        except Exception as e:
            print(f"⚠️ Error: {e}")
Enter fullscreen mode Exit fullscreen mode

6. Summarize with GPT

import openai
client = openai.OpenAI(api_key="YOUR_OPENAI_API_KEY")

def summarize_with_gpt(question, chunks):
    if not chunks:
        return "Sorry, I couldn't find enough information."

    prompt = f'''
Use the following website content to answer this question:

{''.join(chunks)}

Q: {question}
A:
'''

    res = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.4,
    )
    return res.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

7. Streamlit Frontend

import streamlit as st

st.set_page_config(page_title="Website Chatbot", page_icon="💬")
st.title("💬 Website Chatbot")
st.caption("Ask anything based on your website content.")

question = st.chat_input("Ask your question...")

if question:
    with st.chat_message("user"):
        st.markdown(question)

    with st.spinner("Thinking..."):
        chunks = query_website(question)
        final_answer = summarize_with_gpt(question, chunks)

    with st.chat_message("assistant"):
        st.markdown(final_answer)
Enter fullscreen mode Exit fullscreen mode

✅ Live Demo Preview

  • Ask: What services do you offer?
  • Ask: Where is your pricing page?
  • Ask: How can I contact support?

The assistant will generate answers using real indexed content from your website.


🧠 Next Steps

  • Save/load your ChromaDB collection
  • Split large documents into smaller chunks
  • Include source URLs in GPT responses
  • Add memory to handle multi-turn chat

🎯 Conclusion

Congratulations! You've now built a fully functional AI-powered chatbot that can answer questions from your website using ChromaDB, Olostep, and OpenAI — all wrapped in a beautiful Streamlit app.

Whether for internal docs, support, or public knowledge, this gives you ChatGPT power without managing any LLMs.

Happy building! 🚀
https://gist.github.com/mdehsan873/f69481997f487e23b1d1282c82ce00f5

Top comments (0)