Gao Dalie (高達烈)

Posted on Jun 23

MemoryOS MCP + RAG Agent That Can Remember Anything

#datascience #ai #opensource #llm

In this Story, I have a super quick tutorial showing you how to create a multi-agent chatbot using Memory Operation System MCP and RAG to build a powerful agent chatbot for your business or personal use. This combo gives your chatbot supercharged memory and intelligence

If this is your first time watching me, I’d really recommend checking out some of my earlier videos on MCP. I made a video about MCP that ended up blowing up in the AI community.

Let’s face it—LLMs have become an important part of building artificial intelligence applications. **However, to be able to interact with complex environments effectively, there is “little trouble” with many AIs at present

They are very smart and respond quickly, but like a goldfish with a memory of only seven seconds, the longer the conversation and the more complicated the matter, the easier it is for them to “black out”.

This is not because they are stupid, but because their innate way of handling “remembering things” (that is, “context”) is a bit limited.

Imagine chatting with a friend who has forgotten everything you ever said. Every conversation has to start from the beginning, with no memory, no context, and no progression.

It feels awkward, exhausting, and impersonal. These models struggle to maintain dialogue coherence, recall user-specific preferences, and sustain a continuous, personalised interaction across multiple sessions.

Unfortunately, this is how most AI systems behave today. They are smart, but they lack one crucial thing: memory

That’s where Memory OS MCP comes in. MemoryOS systematically organizes, updates, and retrieves conversation data across multiple memory tiers — short-term, mid-term, and long-term personal memory.

where AI agents to effectively retain and access relevant information over extended interactions, thus overcoming the limitations of fixed context windows and improving long-term coherence, personalisation, and user experience in dialogue systems.

Let me show you a quick demo of a live chatbot so you can see how it works behind the scenes.

Check a Video

I will ask the chatbot a question: ‘What do you remember about my job and hobbies?’ If you take a look at how the chatbot generates the output, you’ll see that the MemoryOS MCP agent is initialized with user and assistant IDs, API keys, data storage paths, and various capacity and threshold settings, creating dedicated storage for each user and assistant.

User inputs and agent responses are added as QA pairs and initially stored in short-term memory. Once short-term memory reaches its limit, the Updater module consolidates these interactions into meaningful segments and moves them to mid-term memory.

Over time, mid-term memory segments accumulate ‘heat’ based on visit frequency and interaction length. When a segment’s heat exceeds a set threshold, its content is analyzed to extract user profile insights, update the user’s long-term knowledge, and enhance the assistant’s long-term knowledge base.

During response generation, the Retriever module gathers relevant context from short-term history, mid-term segments, the user’s profile and knowledge, and the assistant’s knowledge base. This full context, combined with the user’s query, is then passed to a large language model to generate a coherent and informed response.

So, by the end of this story, you will understand what Memory OS is, how Memory OS use MCP, how it works and even how we’re going to use Memory OS MCP and RAG to create a powerful agentic chatbot.

What is Memory OS

MemoryOS is a memory management system designed to extend the capabilities of AI agents (LLMs), addressing their limited context windows.

It introduces a hierarchical memory structure with Short-Term Memory (STM) for real-time interactions, Mid-Term Memory (MTM) for topic-based grouping, and Long-Term Personal Memory (LPM) for storing user traits and preferences.

Memory is updated dynamically using First in First Out and heat-based strategies, ensuring important information is retained over time. Semantic retrieval helps fetch relevant context across all memory layers, enabling the AI to generate more coherent, personalized, and context-aware responses across sessions.

How Memory OS use MCP

MemoryOS uses the Memory Chunking Planning (MCP) mechanism to improve memory management and retrieval. MCP divides large conversation histories or knowledge into smaller, manageable chunks that align with MemoryOS’s hierarchical architecture, where dialogues are stored as pages within segments.

It also helps plan and optimize which memory chunks to retrieve based on relevance, recency, or importance, supporting efficient semantic search and maintaining contextual continuity.

This chunking approach enables dynamic memory updates, allowing MemoryOS to retrieve or replace specific segments while discarding less relevant data using heat-based eviction strategies.

How it works

MemoryOS structures an AI’s memory using a hierarchical, OS-inspired architecture with four main modules: Storage, Updating, Retrieval, and Generation. It follows a three-tier system — short-term QA pairs, mid-term sessions with heat tracking, and long-term profiles and knowledge, while separating user-specific data from assistant knowledge.

On initialization, it creates file-based storage for each tier and sets up Updater and Retriever modules. New interactions are added to short-term memory, then moved to mid-term when full. Sessions marked as “hot” (based on visit frequency, interaction length, and recency) trigger LLM analysis.

If heat exceeds a threshold, unanalyzed content is extracted for LLM-based user profiling and knowledge updates, which are stored in the long-term memory. The session heat is then reset.

For response generation, MemoryOS retrieves relevant context from all tiers, builds prompts, queries the LLM, and stores the new interaction. It uses lazy evaluation for profile updates, similarity-based retrieval, heat decay to avoid redundant processing, and maintains separate user and assistant knowledge to learn effectively over time.

Path to Code: Link

Let’s code

Before we dive into our application, we will create an ideal environment for the code to work. Let’s create the MCP Tool Server, where I will show how you can use the same concept in your chatbot

MCP Tools
They create a FastMCP server and define a tool function add_memory using the @mcp.tool() decorator, which means this function can now be called like a tool inside the FastMCP framework.

This function takes in the user’s input (user_input), the assistant’s reply (agent_response), and optionally, a timestamp and some meta_data. Inside the function, we check if the memoryos_instance is initialized—if not, we return an error. It also checks if the required fields are empty and returns an error if either is missing. If everything looks good, we use the add_memory() method from memoryos_instance to save the memory.

Finally, it returns a result that includes the status, a success message, and a timestamp (either the one passed in or generated using get_timestamp()), and some extra info like the length of both messages and whether metadata was included. If something goes wrong during the process, we catch the error and return an error message explaining what happened.

# Create a FastMCP server instance
mcp = FastMCP("MemoryOS")

@mcp.tool()
def add_memory(user_input: str, agent_response: str, timestamp: Optional[str] = None, meta_data: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
   Args:  
        user_input: The user's input or question  
        agent_response: The agent's response  
        timestamp: Optional timestamp (format: YYYY-MM-DD HH:MM:SS)  
        meta_data: Optional metadata (JSON object)  

    Returns:  
        A dictionary containing the operation result  
    """
    global memoryos_instance

    if memoryos_instance is None:
        return {
            "status": "error",
            "message": "MemoryOS is not initialized. Please check the configuration file."
        }

    try:
        if not user_input or not agent_response:
            return {
                "status": "error",
                "message": "user_input and agent_response are required"
            }

        memoryos_instance.add_memory(
            user_input=user_input,
            agent_response=agent_response,
            timestamp=timestamp,
            meta_data=meta_data
        )

        result = {
            "status": "success",
            "message": "Memory has been successfully added to MemoryOS",
            "timestamp": timestamp or get_timestamp(),
            "details": {
                "user_input_length": len(user_input),
                "agent_response_length": len(agent_response),
                "has_meta_data": meta_data is not None
            }
        }

        return result

    except Exception as e:
        return {
            "status": "error",
            "message": f"Error adding memory: {str(e)}"
        }

So, they create retrieve_memory to fetch memories from different layers of memoryOS based on a user's question.

It takes in a few inputs: the main query string (which is required), a description of your relationship with the user (like “friend” or “assistant”), a parameter style_hint to guide the tone of the answer, and a number max_results that controls how many answers to bring back. First, it checks if it memoryos_instance exists—if not, it throws an error. Then, it checks that a query is provided—if not, it also returns an error.

They usedretriever inside memoryos_instance to find relevant information using the query and the user ID. It also pulls in the short-term memory (recent QA pairs) and the user’s profile summary.

Then, it builds a JSON file to send back as the result, which includes the status, original query, timestamp, user profile, short-term memory, and the most relevant content pulled from mid-term and long-term storage. It neatly formats the returned data to only show what matters—like user input, agent replies, timestamps, and any extra info.

@mcp.tool()
def retrieve_memory(query: str, relationship_with_user: str = "friend", style_hint: str = "", max_results: int = 10) -> Dict[str, Any]:
   """
    Retrieve relevant memories and contextual information from MemoryOS based on a query, 
    including short-term memory, mid-term memory, and long-term knowledge

    Args:
        query: The search query describing the information to find
        relationship_with_user: Type of relationship with user (e.g., friend, assistant, colleague)
        style_hint: Response style hint
        max_results: Maximum number of results to return

    Returns:
        A dictionary containing retrieval results including:
        - short_term_memory: All QA pairs from current short-term memory
        - retrieved_pages: Relevant pages retrieved from mid-term memory
        - retrieved_user_knowledge: Relevant entries from user's long-term knowledge base
        - retrieved_assistant_knowledge: Relevant entries from assistant's knowledge base
    """
    global memoryos_instance

    if memoryos_instance is None:
        return {
            "status": "error",
            "message": "MemoryOS is not initialized. Please check the configuration file."
        }

    try:
        if not query:
            return {
                "status": "error",
                "message": "query parameter is required"
            }

        # Use retriever to get relevant context
        retrieval_results = memoryos_instance.retriever.retrieve_context(
            user_query=query,
            user_id=memoryos_instance.user_id
        )

        # Get short-term memory content
        short_term_history = memoryos_instance.short_term_memory.get_all()

        # Get user profile
        user_profile = memoryos_instance.get_user_profile_summary()

        # Organize return results
        result = {
            "status": "success",
            "query": query,
            "timestamp": get_timestamp(),
            "user_profile": user_profile if user_profile and user_profile.lower() != "none" else "No detailed user profile",
            "short_term_memory": short_term_history,
            "short_term_count": len(short_term_history),
            # "retrieved_pages": retrieval_results["retrieved_pages"][:max_results],
            # "retrieved_user_knowledge": retrieval_results["retrieved_user_knowledge"][:max_results],
            # "retrieved_assistant_knowledge": retrieval_results["retrieved_assistant_knowledge"][:max_results],
            "retrieved_pages": [{
                'user_input': page['user_input'],
                'agent_response': page['agent_response'],
                'timestamp': page['timestamp'],
                'meta_info': page['meta_info']
            } for page in retrieval_results["retrieved_pages"][:max_results]],

            "retrieved_user_knowledge": [{
                    'knowledge': k['knowledge'],
                    'timestamp': k['timestamp']
                } for k in retrieval_results["retrieved_user_knowledge"][:max_results]],

            "retrieved_assistant_knowledge": [{
                'knowledge': k['knowledge'],
                'timestamp': k['timestamp']
            } for k in retrieval_results["retrieved_assistant_knowledge"][:max_results]],
            # "total_pages_found": len(retrieval_results["retrieved_pages"]),
            # "total_user_knowledge_found": len(retrieval_results["retrieved_user_knowledge"]),
            # "total_assistant_knowledge_found": len(retrieval_results["retrieved_assistant_knowledge"])
        }

        return result

    except Exception as e:
        return {
            "status": "error",
            "message": f"Error retrieving memory: {str(e)}"
        }

After that, they create get_user_profile to pull together detailed profile information about the user from MemoryOS. It includes two things:

the user’s long-term knowledge entries (if include_knowledge is set to True), and the assistant’s knowledge base (if include_assistant_knowledge is True). First, it checks whether the memoryos_instance is initialised—if not, it sends back an error. If everything is in place, it uses get_user_profile_summary() to fetch personality traits, preferences, or a general profile summary of the user.

That info gets stored in a dictionary with a success status, timestamp, user and assistant IDs, and the actual profile info — unless the profile is missing, in which case it returns a placeholder message. If we asked for user knowledge, it loops through the user’s long-term memory and adds those entries, each with the knowledge text and its timestamp.

@mcp.tool()
def get_user_profile(include_knowledge: bool = True, include_assistant_knowledge: bool = False) -> Dict[str, Any]:
    """
    Get the user's profile information including personality traits, preferences, and related knowledge

    Args:
        include_knowledge: Whether to include user-related knowledge entries
        include_assistant_knowledge: Whether to include assistant knowledge base

    Returns:
        A dictionary containing user profile information
    """
    global memoryos_instance

    if memoryos_instance is None:
        return {
            "status": "error",
            "message": "MemoryOS is not initialized. Please check the configuration file."
        }

    try:
        # 获取用户画像
        user_profile = memoryos_instance.get_user_profile_summary()

        result = {
            "status": "success",
            "timestamp": get_timestamp(),
            "user_id": memoryos_instance.user_id,
            "assistant_id": memoryos_instance.assistant_id,
            "user_profile": user_profile if user_profile and user_profile.lower() != "none" else "No detailed user profile"
        }

        if include_knowledge:
            user_knowledge = memoryos_instance.user_long_term_memory.get_user_knowledge()
            result["user_knowledge"] = [
                {
                    "knowledge": item["knowledge"],
                    "timestamp": item["timestamp"]
                }
                for item in user_knowledge
            ]
            result["user_knowledge_count"] = len(user_knowledge)

        if include_assistant_knowledge:
            assistant_knowledge = memoryos_instance.get_assistant_knowledge_summary()
            result["assistant_knowledge"] = [
                {
                    "knowledge": item["knowledge"],
                    "timestamp": item["timestamp"]
                }
                for item in assistant_knowledge
            ]
            result["assistant_knowledge_count"] = len(assistant_knowledge)

        return result

    except Exception as e:
        return {
            "status": "error",
            "message": f"Error getting user profile: {str(e)}"
        }

Memory OS

Then, they create a memory operation system. First, it defines some configuration details like the user and assistant IDs, API keys, data storage path, and which language model to use (gpt-4o-mini). Inside the simple_demo() function, it initialized the Memoryos instance, setting up its storage and memory thresholds. If everything goes well, it prints a success message.

Then, it adds three example memory entries—basically short conversations between the user and the assistant, where Tom shares personal details like his job, location, hobbies, and musical interests. After storing this context, the script simulates a query: “What do you remember about my job and hobbies?” It calls get_response() on the memo instance to generate a reply based on the stored memories.

import os
from memoryos import Memoryos

# --- Basic Configuration ---
USER_ID = "demo_user"
ASSISTANT_ID = "demo_assistant"
OPENAI_API_KEY = "Your_key"
OPENAI_BASE_URL = ""
DATA_STORAGE_PATH = "./simple_demo_data"
LLM_MODEL = "gpt-4o-mini"

def simple_demo():
    print("🚀 MemoryOS Simple Demo")

    # 1.Initialize MemoryOS
    print("📦 Initializing MemoryOS...")
    try:
        memo = Memoryos(
            user_id=USER_ID,
            openai_api_key=OPENAI_API_KEY,
            openai_base_url=OPENAI_BASE_URL,
            data_storage_path=DATA_STORAGE_PATH,
            llm_model=LLM_MODEL,
            assistant_id=ASSISTANT_ID,
            short_term_capacity=7,  
            mid_term_heat_threshold=5,  
        )
        print("✅ MemoryOS initialized successfully!\n")
    except Exception as e:
        print(f"❌ Error: {e}")
        return

    # 2. Add some basic memories
    print("💾 Adding some memories...")

    memo.add_memory(
        user_input="Hi! I'm Tom, I work as a data scientist in San Francisco.",
        agent_response="Hello Tom! Nice to meet you. Data science is such an exciting field. What kind of data do you work with?"
    )

    memo.add_memory(
        user_input="I mainly work with e-commerce data. I also love playing guitar in my free time.",
        agent_response="That's a great combination! E-commerce analytics must provide fascinating insights into consumer behavior. How long have you been playing guitar?"
    )

    memo.add_memory(
        user_input="I've been playing for about 5 years. I really enjoy blues and rock music.",
        agent_response="Five years is a solid foundation! Blues and rock are fantastic genres for guitar. Do you have a favorite artist or song you like to play?"
    )



    test_query = "What do you remember about my job and hobbies?"
    print(f" User: {test_query}")

    response = memo.get_response(
        query=test_query,
    )

    print(f"Assistant: {response}")

if __name__ == "__main__":
    simple_demo()

Conclusion :

Say goodbye to goldfish brains, and hello to smarter AI companions

MemoryOS MCP implements it like giving AI Agents a powerful “memory plug-in”. It solves the common “forgetfulness” problem of AI, allowing them to better understand context, maintain status, and handle complex tasks.

With the support of this “super memory”, the AI Agent of the future will no longer be just a flash of intelligence, but a truly reliable intelligent partner that can continuously learn, remember us, and establish deeper interactions with us

Reference :