Benchmarks

Brain Memory ships with a controlled benchmark suite that measures whether persistent, retrieval-based memory actually helps a coding agent — under realistic noise, across multiple sessions, and against honest baselines.

The system itself is a direct implementation of the CoALA agent-memory model (Sumers et al., 2023) — Pinned Tier maps to CoALA's semantic memory, procedural skills to procedural memory, and the session-start aggregator to CoALA's working-memory channel. The benchmark methodology follows the 2025-2026 state of the art: distractor haystacks (LongMemEval), cross-family LLM judges with rubric (preference-leakage mitigation, arxiv 2502.01534), continual-learning metrics on a repo-scoped task sequence (SWE-Bench-CL), and tokens-per-successful-task as the headline efficiency axis (Mem0 / BEAM). Full reference list on the Methodology page.

info

This is a redesign of the original benchmark. The legacy 5-scenario suite (still on disk under scenarios/scenario-1-* through scenario-5-*) used regex evaluators and prepended every seeded memory verbatim — that tested long context, not memory architecture. The new suite shells out to the real brain CLI for retrieval and uses an LLM judge with explicit rubrics.

What the benchmark measures

Four headline axes, every scenario, every run:

MetricWhat it answers
Task pass rateDid the produced code satisfy the rubric? Cross-family LLM judge (Claude judges Gemini, Gemini judges Claude) — no preference leakage.
Tokens per successful task(input + output tokens) / passes. The honest efficiency measure — Mem0/BEAM standard.
Recall@kFor arms that use the real brain CLI, did the right memory IDs surface in the top-k against the scenario's oracle set?
Judge rationaleCaptured verbatim per run for spot-checks.

Plus, where applicable: per-task pass rate (continual scenarios), forward-transfer Δ tokens, confabulation rate, and write-side cost (memorize + sleep + skill distillation).

The 6 scenarios

Each scenario has a one-sentence pitch you can use directly:

IdPitchWhat it tests
A Noisy Project Folder"Your brain has 200 memories from 6 projects. I ask you to add a feature to project X. Do you find the 3 relevant memories?"Retrieval under distractors (LongMemEval-S analog)
B Three Sessions, One Decision"On Monday we picked Postgres. On Wednesday I rewrote the API. On Friday I add a new resource — does it still use Postgres?"Multi-session continuity + pinned tier ablation
C The Contradiction Test"Three weeks ago I told you tabs. Two weeks ago, spaces. Last week, tabs again. New file — which do you use?"Decay-weighted recency + contradiction handling
D Skill Progressive Disclosure"You have a pg-migration skill. I ask you to add a migration. Did you load the full SKILL.md, or just see the index entry and ignore it?"Procedural skills — three-rung L0/L1/L2 token efficiency
E Continual Coding"Five async bugs in the same repo, in order. Does session 5 finish faster because of sessions 1-4?"Forward transfer + tokens per resolved task. The agent writes its own memories between bugs — exercises the write side end-to-end.
F Abstention"I never told you my deployment target. Where do you deploy this?"Confabulation resistance — does the agent invent or notice the gap?

Read more: Methodology · Scenario details · Latest results

Arm matrix

Every scenario runs across multiple arms. The arms vary independently so we can attribute gains to specific features (Pinned Tier, Skills, recall, distractors), rather than just say "with vs. without memory."

ArmWhat it doesWhat it isolates
bareNo memory, no fixturesFloor
fixture-onlyRealistic project files, no brain (= old "without-brain")Honest baseline
brain-realFull brain via brain session-start, distractor haystack, pin + skills onWhat ships in production
brain-no-recallAll oracle memories prepended verbatimQuantifies long-context vs retrieval value
brain-no-pinbrain-real with pinned tier disabledCoALA Phase-1 attribution
brain-no-skillsbrain-real with skills disabledCoALA Phase-2 attribution
brain-skills-L0 / brain-skills-loaded / brain-skills-all-loadedScenario D's progressive-disclosure ablationPer-tier skill cost
dump-all-chrono / context-dumpFull memory contents concatenatedUpper bound — proves memory ≠ long-context

Where to look next

  • Methodology — How retrieval is scored, why the judge is cross-family, what distractor haystacks contain, references to the SOTA papers the design follows.
  • Scenarios — The full pitch, oracle answer, and rubric for each of the six scenarios.
  • Results — Current numbers from the smoke runs, with raw JSON downloads.
  • Sourcebenchmark/ in the repository.