Memory Systems in AI Agents: Short-Term Context vs Long-Term Vector Storage

Imagine an AI agent that remembers your last conversation but forgets your name by the next session. Maddening, right? Yet this is exactly where most agents fail. The problem isn’t the model itself—it’s the memory architecture that comes after it.

In this article, you’ll learn the difference between short-term conversational context and long-term vector storage. We’ll cover Memory Architectures, Conversational Context, Long-Term Interaction History, Episodic Recall, Vector Databases, PgVector, and Milvus. By the end, you’ll know exactly how to give your agent a memory that lasts beyond one chat.

What Even Is Memory Architecture?

Memory Architecture refers to the structural design that determines how an agent stores, retrieves, and forgets information.

Think of it like your own brain. You don’t remember every coffee you’ve ever had, but you remember that Monday morning double-shot espresso that saved you from a meeting. An AI memory architecture must make similar choices—what stays in immediate awareness and what gets archived for later.

Under the hood, most agents use a sliding window for short-term context and a database for long-term storage. The key insight: without an explicit memory architecture, your agent is essentially a goldfish. It forgets everything as soon as the conversation ends.

Here’s a simple example of a memory architecture sketch in Python:

class AgentMemory:
    def __init__(self):
        self.short_term = []  # Current conversation
        self.long_term = []   # Past interactions (conceptual)
    
    def add_turn(self, user_input, agent_reply):
        self.short_term.append({"user": user_input, "agent": agent_reply})
        if len(self.short_term) > 10:
            self.archive_to_long_term()

This shows the basic split. The short_term holds recent context, while the long_term persists beyond the session.

Conversational Context: The Agent’s Scratchpad

Conversational Context is the information the agent actively holds during a single chat session. It’s the “working memory” of your agent.

Imagine you’re telling a story to a friend. You don’t need to repeat the first sentence every time—they remember the thread. That’s conversational context. The agent keeps the last few exchanges available so it can refer to them without re-querying a database.

Most frameworks implement this as a message list. Here’s a concrete example using a hypothetical Python library:

conversation_context = [
    {"role": "user", "content": "What's the weather like in Paris?"},
    {"role": "assistant", "content": "It's currently 15°C and cloudy."},
    {"role": "user", "content": "And in London?"}
]
# The agent knows you're asking about weather because of the context.

A non-obvious gotcha: context windows have a cost. Each token costs memory and compute. If your agent holds 50,000 tokens of context, it’s paying for every one of them, even the irrelevant ones. Be ruthless about trimming.

Long-Term Interaction History: Beyond the Session

Long-Term Interaction History stores every conversation your agent has ever had with a user, spanning multiple sessions and days.

Think of this like a diary. You don’t read it cover to cover every time, but it’s there when you need to remember what happened last month. For an agent, this history is stored separately from the current chat.

The mechanism is straightforward: each interaction gets saved with a timestamp and user ID. When the agent needs context from a previous session, it queries this store. Here’s a minimal example:

import json
from datetime import datetime

def save_interaction(user_id, user_input, agent_reply):
    entry = {
        "user_id": user_id,
        "input": user_input,
        "reply": agent_reply,
        "timestamp": datetime.now().isoformat()
    }
    with open(f"history_{user_id}.json", "a") as f:
        f.write(json.dumps(entry) + "\n")

The tricky part: you can’t feed the entire history into every prompt. That would be like reading War and Peace before answering a question about the weather. We need a smarter retrieval method—that’s where episodic recall comes in.

Episodic Recall: Finding the Right Memory at the Right Time

Episodic Recall is the ability to retrieve specific, relevant past interactions from long-term memory based on the current conversation.

It’s like having a filing cabinet where every document is tagged with a timestamp and a summary. When you need a specific case, you don’t flip through every file—you search for the right tags.

Here’s how it works under the hood: the agent takes the current user query, converts it to a mathematical representation (an embedding), and then searches the long-term store for similar embeddings. Only the closest matches get injected into the prompt.

def episodic_recall(current_query, user_history):
    # Pseudocode for embedding-based search
    query_embedding = get_embedding(current_query)
    scored_memories = []
    for memory in user_history:
        mem_embedding = get_embedding(memory["text"])
        score = cosine_similarity(query_embedding, mem_embedding)
        scored_memories.append((score, memory))
    scored_memories.sort(reverse=True)
    return scored_memories[:3]  # Top 3 most relevant memories

A practical insight: episodic recall fails when embeddings are poorly trained on domain-specific language. If your agent talks about “Python” (the snake), it might retrieve “Python” (the language) instead. Know your data.

Vector Databases: The Index for Meaning

Vector Databases are specialized databases designed to store and search high-dimensional embeddings—mathematical vectors that represent the meaning of text, images, or any data.

Think of a vector database as a librarian who can find books not just by title or author, but by what the book means. You ask for “stories about resilience,” and it returns a biography, a fantasy novel, and a self-help guide—all because their semantic content is similar.

The mechanism: each piece of data is converted into a vector (a list of numbers). The database indexes these vectors using algorithms like HNSW (Hierarchical Navigable Small Worlds). When you query, it converts your query into a vector and finds the closest neighbors quickly.

PgVector: SQL on Steroids

PgVector is a PostgreSQL extension that adds vector storage and search capabilities directly to your relational database.

This is the best of both worlds. You keep all your structured data (user IDs, timestamps) in one place, and you get vector search without managing a second system. If you already use PostgreSQL, PgVector is a drop-in solution.

Here’s how you set it up and use it:

-- Enable the extension
CREATE EXTENSION vector;

-- Create a table with a vector column
CREATE TABLE memories (
    id SERIAL PRIMARY KEY,
    user_id INT,
    content TEXT,
    embedding VECTOR(1536)  -- 1536-dimensional vector
);

-- Search for similar memories
SELECT content FROM memories
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector  -- cosine distance
LIMIT 5;

The gotcha: PgVector is not as fast as dedicated vector databases at massive scales (millions of vectors). It’s perfect for small-to-medium applications. Don’t assume it scales infinitely.

Milvus: Storage at Scale

Milvus is an open-source vector database built specifically for large-scale similarity search. It’s designed to handle billions of vectors with millisecond latency.

Imagine PgVector as a well-organized home filing cabinet. Milvus is the Library of Congress with robotic retrieval arms. It’s overkill for a personal project but essential for enterprise-level agent systems handling millions of users.

Milvus uses a distributed architecture with multiple workers. You define a “collection” (like a table), insert vectors, and then search using the SDK. Here’s a minimal Python example:

from pymilvus import Collection, connections

# Connect to Milvus server
connections.connect(host='localhost', port='19530')

# Define a collection
collection = Collection("agent_memories")
collection.load()

# Search
query_vector = [0.1, 0.2, 0.3, ...]  # Your embedding
results = collection.search(
    data=[query_vector],
    anns_field="embedding",
    param={"metric_type": "COSINE"},
    limit=5
)

A key performance consideration: Milvus requires more operational overhead (a dedicated server, configuration management). Don’t jump to Milvus if PgVector or even a simple JSON file works for your use case.

Comparison Table: Which Memory Tool When?

Concept	Purpose	When to Use	Scale
Conversational Context	Short-term session memory	Every agent, always	Low (hundreds of tokens)
Long-Term Interaction History	Full log of past sessions	Agents that serve returning users	Medium (thousands of records)
Episodic Recall	Smart retrieval of relevant past	When history is too large for full context	Medium
PgVector	Vector + relational search in one DB	You already use PostgreSQL, small-to-medium scale	Low-Millions of vectors
Milvus	Dedicated large-scale vector search	Enterprise agents, billions of vectors	High (billions of vectors)

Key Takeaways

Memory Architecture is the blueprint for how an agent stores and retrieves information. Build it intentionally.
Conversational Context is the agent’s short-term scratchpad. Keep it lean to avoid token waste.
Long-Term Interaction History persists across sessions. Store it in a structured format.
Episodic Recall retrieves relevant past interactions using embedding similarity. It’s your smart retrieval system.
Vector Databases (PgVector, Milvus) store embeddings for fast semantic search. Choose based on your scale.
PgVector works great for small-to-medium projects with existing PostgreSQL infrastructure.
Milvus is the heavy lifter for large-scale, distributed vector search.
The non-obvious truth: many agents don’t need a vector database at all. Start simple—a JSON file or a SQL table—and graduate to vectors only when retrieval quality becomes a bottleneck.