Benchmarks
Brain Memory ships with a controlled benchmark suite that measures whether persistent, retrieval-based memory actually helps a coding agent — under realistic noise, across multiple sessions, and against honest baselines.
The system itself is a direct implementation of the CoALA agent-memory model (Sumers et al., 2023) — Pinned Tier maps to CoALA's semantic memory, procedural skills to procedural memory, and the session-start aggregator to CoALA's working-memory channel. The benchmark methodology follows the 2025-2026 state of the art: distractor haystacks (LongMemEval), cross-family LLM judges with rubric (preference-leakage mitigation, arxiv 2502.01534), continual-learning metrics on a repo-scoped task sequence (SWE-Bench-CL), and tokens-per-successful-task as the headline efficiency axis (Mem0 / BEAM). Full reference list on the Methodology page.
This is a redesign of the original benchmark. The legacy 5-scenario suite (still on disk under scenarios/scenario-1-* through scenario-5-*) used regex evaluators and prepended every seeded memory verbatim — that tested long context, not memory architecture. The new suite shells out to the real brain CLI for retrieval and uses an LLM judge with explicit rubrics.
What the benchmark measures
Four headline axes, every scenario, every run:
| Metric | What it answers |
|---|---|
| Task pass rate | Did the produced code satisfy the rubric? Cross-family LLM judge (Claude judges Gemini, Gemini judges Claude) — no preference leakage. |
| Tokens per successful task | (input + output tokens) / passes. The honest efficiency measure — Mem0/BEAM standard. |
| Recall@k | For arms that use the real brain CLI, did the right memory IDs surface in the top-k against the scenario's oracle set? |
| Judge rationale | Captured verbatim per run for spot-checks. |
Plus, where applicable: per-task pass rate (continual scenarios), forward-transfer Δ tokens, confabulation rate, and write-side cost (memorize + sleep + skill distillation).
The 6 scenarios
Each scenario has a one-sentence pitch you can use directly:
| Id | Pitch | What it tests |
|---|---|---|
| A Noisy Project Folder | "Your brain has 200 memories from 6 projects. I ask you to add a feature to project X. Do you find the 3 relevant memories?" | Retrieval under distractors (LongMemEval-S analog) |
| B Three Sessions, One Decision | "On Monday we picked Postgres. On Wednesday I rewrote the API. On Friday I add a new resource — does it still use Postgres?" | Multi-session continuity + pinned tier ablation |
| C The Contradiction Test | "Three weeks ago I told you tabs. Two weeks ago, spaces. Last week, tabs again. New file — which do you use?" | Decay-weighted recency + contradiction handling |
| D Skill Progressive Disclosure | "You have a pg-migration skill. I ask you to add a migration. Did you load the full SKILL.md, or just see the index entry and ignore it?" | Procedural skills — three-rung L0/L1/L2 token efficiency |
| E Continual Coding | "Five async bugs in the same repo, in order. Does session 5 finish faster because of sessions 1-4?" | Forward transfer + tokens per resolved task. The agent writes its own memories between bugs — exercises the write side end-to-end. |
| F Abstention | "I never told you my deployment target. Where do you deploy this?" | Confabulation resistance — does the agent invent or notice the gap? |
Read more: Methodology · Scenario details · Latest results
Arm matrix
Every scenario runs across multiple arms. The arms vary independently so we can attribute gains to specific features (Pinned Tier, Skills, recall, distractors), rather than just say "with vs. without memory."
| Arm | What it does | What it isolates |
|---|---|---|
bare | No memory, no fixtures | Floor |
fixture-only | Realistic project files, no brain (= old "without-brain") | Honest baseline |
brain-real | Full brain via brain session-start, distractor haystack, pin + skills on | What ships in production |
brain-no-recall | All oracle memories prepended verbatim | Quantifies long-context vs retrieval value |
brain-no-pin | brain-real with pinned tier disabled | CoALA Phase-1 attribution |
brain-no-skills | brain-real with skills disabled | CoALA Phase-2 attribution |
brain-skills-L0 / brain-skills-loaded / brain-skills-all-loaded | Scenario D's progressive-disclosure ablation | Per-tier skill cost |
dump-all-chrono / context-dump | Full memory contents concatenated | Upper bound — proves memory ≠ long-context |
Where to look next
- Methodology — How retrieval is scored, why the judge is cross-family, what distractor haystacks contain, references to the SOTA papers the design follows.
- Scenarios — The full pitch, oracle answer, and rubric for each of the six scenarios.
- Results — Current numbers from the smoke runs, with raw JSON downloads.
- Source —
benchmark/in the repository.