Agent memory as infrastructure, not chat history

The default story for "agent memory" today is some version of: the model sees the recent conversation, plus maybe a vector store of older messages it can pull from on demand. That works for a while. It stops working at exactly the point you start needing memory to be useful. Which is when the agent crosses a boundary.

A new chat session. A different tool. A new project. A handoff between two agents. At every one of those boundaries, the implicit contract of "memory = chat history" breaks, and what you actually need is something more like a database.

What chat history can't do

Concretely, four things:

Cross-tool continuity. Your editor agent decided last Tuesday that the team uses snake_case for new modules. Your CLI agent doesn't know that, because it's a different process with a different conversation log.
Structured retrieval. "What does the agent know about authentication in this codebase?" is a question chat history can't answer cleanly. It has to scan a wall of dialogue and hope the relevant context resurfaces.
Trust differentiation. A fact the user explicitly told the agent should outweigh a fact the agent inferred from a single file. Chat history flattens both into "something that was said in a conversation."
Decay. A decision made six months ago about an API that's since been refactored is actively wrong. Chat history has no concept of stale.

Vector stores fronting chat history fix retrieval but inherit every other problem. The store is full of dialogue snippets with no contract about what they assert, no provenance, and no decay policy. It looks like memory; it's a search index over things that were said.

What memory-as-infrastructure looks like

A different shape: treat memory as a typed, contracted, queryable surface that lives outside any one agent or session.

diagramscroll horizontally if needed

rendering diagram…

Four pieces, each doing real work:

Typed writes. When an agent records something, it has to declare what kind of thing it is. A fact, a decision, a convention, an ownership edge. Along with the scope it applies to (project, module, file) and where it came from (which agent, which tool, which commit). This is the part that makes memory queryable months later. An untyped write is a junk-drawer entry; a typed write is a row.

Provenance on every write. Every entry carries a small bundle: which agent wrote it, in which tool, at which commit, prompted by which user action. This is what lets you answer "why does the agent think this?" six weeks later when something is wrong. Without it, you have a model that's confidently citing things it can't defend.

Intent-scoped retrieval. Reads aren't "here's a query embedding, give me the top 10 nearest neighbors." Reads are "the agent is about to refactor module X. What facts, decisions, conventions, and ownership edges scope to that module?" Vector similarity is one input. The graph structure of the memory does most of the work.

Decay policy. Beliefs go stale. Without explicit decay, memory becomes a fossil record of every wrong guess the agent ever made. Per-fact decay timestamps let retrieval favor recent, reinforced knowledge. And let memory honestly say "I used to think X, but I last wrote that 90 days ago and nothing has reinforced it since."

The contract is the part that matters

Of those four pieces, the contract layer is the one that earns its keep over time and the one most projects skip because it's tedious.

A contract on writes does three things: it forces the agent to be specific about what it's claiming (which improves the writes), it makes the memory legible to humans who weren't in the conversation that produced it (which is what makes it useful), and it gives you a rejection point for low-quality writes (which is what stops the store from filling up with garbage).

Without it, you get a memory that grows without limit, contradicts itself, and can't be queried except by similarity search. Eventually you stop trusting it, and at that point you may as well not have it.

What "query memory" should mean

Not this:

search("authentication module") → top 10 chunks of past conversations

This:

context_for(intent="refactor module auth/", agent="cli-001") →
  facts:        ["bcrypt rounds raised to 12 on 2024-03-04",
                 "session tokens are 32-byte random, base64-urlencoded"]
  decisions:    ["picked argon2 over scrypt for new hashes (ADR-014)"]
  ownership:    ["module owned by @sagar, on-call @priya"]
  conventions:  ["module-private functions prefixed with _"]
  warnings:     ["last refactor here in May 2024 broke prod;
                  see incident-2024-05-08"]

The first is a search. The second is what an experienced colleague would tell you in two minutes if you asked. That's the gap the infrastructure has to close.

diagramscroll horizontally if needed

rendering diagram…

Why this matters now

Every agent product I've looked at has, at some level, recognized that the chat-history-as-memory model isn't quite working. The responses are usually one of: longer context windows, better RAG over conversation logs, or a side-channel notes file the agent maintains.

All three are workarounds for the underlying issue, which is that "a wall of past dialogue" isn't the right primitive. The right primitive is closer to a typed graph with provenance, decay, and intent-scoped retrieval. That's a database. It belongs outside any one agent, behind a contract that's the same regardless of which tool is reading from or writing to it.

What I'm building toward

This is the thesis behind one of my current product explorations for AI-assisted engineering teams: make shared context and verification visible enough that humans can rely on the workflow. The companion post on why enforcement has to be in the request path covers the broader idea without tying it to product internals.

Email me if you're working on something adjacent. I'd like to compare notes. Especially with anyone running multi-tool agentic coding workflows in earnest.