Benchmark Methodology

Every choice in the harness is grounded in a specific finding from the 2024-2026 long-term-memory literature. This page documents what each component does and why.

Cross-family judge panel

harness/judge.js evaluates each run against an explicit per-question rubric of binary criteria. Rather than a single judge, the harness uses a panel of three judges, every one from a different family than the agent under test — so no judge can preference-leak toward its own outputs (Preference Leakage, 2502.01534: same-family judging inflates scores ~24% vs ~3% cross-family). A panel of smaller, disjoint-family judges also beats a single large judge on human agreement with less intra-model bias (Panel of LLM evaluators / PoLL, 2404.18796).

For the current suite the agent under test is DeepSeek V4 Pro, judged by:

Judge	Family	Host
Gemini 2.5 Flash	Google	cloud API
Gemma-4 12B	Google (open)	local (Ollama)
Qwen-3.5 9B	Alibaba	local (Ollama)

A candidate passes only on a majority vote, and each rubric criterion is decided by majority across judges — so a single small-model judge's noise is averaged out rather than deciding the result. Every per-judge verdict and the panel agreement are stored for transparency. The panel composition is the one knob tied to the agent: any judge sharing the agent's family is excluded. (Gemma and Gemini share Google lineage — a minor judge-to-judge correlation — but neither is the DeepSeek family, which is what matters for agent objectivity.)

For pairwise judgments (arm-A vs arm-B), the judge is called twice with the candidate order swapped and the verdict kept only if it survives both orderings — the swap-consistency check from MT-Bench, 2306.05685. Each judgment uses a binary criteria list; LongMemEval reports 97% agreement with humans using this format (LongMemEval, 2410.10813).

info

Same-family judging (GPT-4 grading GPT-4, Claude grading Claude) inflates scores and is common in older benchmarks. A cross-family panel is the stronger guard: it removes self-preference and averages out the per-judge noise that any single LLM judge carries.

Distractor haystacks

harness/distractors.js generates a deterministic 50–1000 memory pool spread across 6 fake projects, 12 topic clusters, and 8 memory types, plus per-scenario hard negatives (plausibly-related-but-superseded memories). The same seed always yields the same corpus. Scenario A runs the full 1000-distractor haystack; the easier scenarios use 50–100.

This matters because without distractors, recall is trivially perfect — a benchmark with 3 oracle memories all on-topic for the task can only ever measure whether the agent uses what it's handed, never whether the memory system can find it. LongMemEval-S/M built distractor haystacks of 50 / 500 sessions explicitly to expose this ceiling.

The pool is seeded into the isolated ~/.brain/ for any arm that declares seed: "scenario+distractors". Oracle memories are layered on top; the brain CLI's TF-IDF, decay, context-match, and spreading-activation scoring must surface the right ones.

Real brain CLI, not prompt injection

The harness shells out to the real brain session-start and brain recall CLIs in the isolated HOME. The agent receives whatever the production brain would surface for the current project context — no shortcuts, no manual memory concatenation.

For the single-shot agent, the recalled memories' content is injected verbatim — the same depth the retriever-baseline arms inject — so the only thing that varies across arms is the retrieval method, never how much of a surfaced memory the agent sees. (A live agentic agent would instead recall an index and read full content on demand; the single-shot path hands it the content directly, keeping the arms apples-to-apples.)

harness/recall-probe.js parses the JSON output and computes:

Recall@k — fraction of setup.oracle_memory_ids[] that appear in the top-k
NDCG@k — discounted cumulative gain, position-weighted

This separates retrieval failure from application failure: if Recall@5 = 1.0 but the judge fails, the agent had the memory and ignored it.

N-arm matrix

harness/arm-runner.js runs each scenario across multiple arms in parallel. Each arm independently configures:

seed — none / scenario / scenario+distractors / distractors-only
memory_injection — none / session-start / recall / dump-bodies / dump-contents
pin — enable / disable the pinned tier
skills — enable / disable skills
skill_load — relevant / all (Scenario D progressive-disclosure ablation)

This means every result is attributable — when brain-real beats brain-no-pin, you know the pinned tier was the cause.

Continual mode

Scenario E runs in "continual mode": one persistent workspace per arm, five chronological bug-fix tasks, with the agent invoking brain memorize between tasks. This exercises the write side end-to-end — if the agent doesn't write a memory, task N+1's session-start payload reflects that.

This mirrors SWE-Bench-CL, which reorganizes SWE-Bench Verified into chronological repo-scoped sequences and measures forward transfer (Δ tokens between task 1 and task N), forgetting, and tokens-per-resolved-issue.

Pitfalls the harness avoids

Pitfall	Where it shows up	How the harness avoids it
Memory dumped verbatim into prompt	Old benchmarks' "with-memory" arm	`memory_injection: "session-start"` uses the real CLI; `dump-bodies`/`dump-contents` exist only as labelled ablation arms
No distractors → trivial recall	Most conversational QA benchmarks	50-200 deterministic distractors per scenario
Regex evaluation → gameable	Anything using pattern matching for pass/fail	LLM judge with explicit rubric; regex kept only as a supplementary check
Same-family judging	GPT-4 grading GPT-4	Cross-family map enforced in `harness/judge.js`
Headline = "+N% accuracy"	First-generation memory benchmarks	Headline = tokens-per-successful-task; pass-rate and Recall@k reported alongside
Write-side cost ignored	Almost everything	Continual mode tracks write-side tokens separately
Ceiling / floor effects	Scenarios that are too easy or too hard	Multiple arms create a spread; baselines bound both ends
Benchmark memorization (leakage from training data)	Public benchmarks	Synthetic-but-grounded distractor corpus, deterministic seeding

Tokens-per-successful-task

The headline efficiency metric is:

tokens_per_success = total_tokens_across_runs / number_of_successful_runs

Token "overhead" from injecting memory only matters if it doesn't translate into more successes. By dividing by passes, this metric is honest: a memory arm that uses 1.5× the tokens but passes 2× as often still wins on tokens-per-success.

This framing comes from Mem0 / BEAM (arxiv 2504.19413), which co-reports accuracy and context-tokens-per-query as co-equal axes rather than apologizing for token overhead.

Reproducibility

3 runs per scenario by default (raise to 5 via --runs 5 for statistical confidence)
Deterministic seeded RNG for distractors — same seed, same corpus, every time
Isolated workspace per run: fresh HOME, fresh brain, no cross-contamination
Median values reported so a single timeout doesn't poison the metric
Results saved per run to benchmark/results/ as JSON + Markdown

References

Foundations

The Brain Memory architecture itself is a direct implementation of the CoALA agent-memory model, with companion influences from Park et al.'s recency · importance · relevance retrieval blend and Packer et al.'s paging-style working memory.

CoALA — Cognitive Architectures for Language Agents (arxiv 2309.02427) — Sumers, Yao, Narasimhan, Griffiths. The agent-memory model Brain implements. Pinned Tier (Phase 1), procedural skills (Phase 2), and budget-aware working memory (Phase 0) all map to CoALA's semantic / procedural / episodic decomposition.
MemGPT — Towards LLMs as Operating Systems (arxiv 2310.08560) — Packer et al. Paging-style memory management that motivated the budget-bounded session-start aggregator.
Generative Agents — Interactive Simulacra of Human Behavior (arxiv 2304.03442) — Park et al. Recency · importance · relevance retrieval blend that underlies Brain's TF-IDF + decay + salience scoring.
Ebbinghaus — Über das Gedächtnis (1885). Original forgetting curve. Brain's per-memory exponential decay rates and spaced-reinforcement boosts follow directly.

Memory benchmarks

LongMemEval (ICLR 2025, arxiv 2410.10813) — distractor haystacks (S / M / Oracle), abstention category, GPT-4o judge with 97% human agreement. The closest analog to Brain's Scenarios A and F.
MemoryAgentBench (arxiv 2507.05257) — four competencies: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. FactConsolidation directly inspired Scenario C (The Contradiction Test).
SWE-Bench-CL (arxiv 2507.00014) — repo-scoped chronological evaluation with forward-transfer / forgetting metrics. Template for Scenario E (Continual Coding).
Mem0 (arxiv 2504.19413) — tokens-per-query co-reported with accuracy. Source of the tokens-per-successful-task headline metric.
BEAM (arxiv 2510.27246) — benchmarks memory up to 10M tokens; structured memory beats a long-context window and the gap widens with scale.
LoCoMo (arxiv 2402.17753) — long-conversation memory benchmark; considered solved since 2025.
MIRIX (arxiv 2507.07957) — realistic synthetic-but-grounded memory benchmarks via multimodal screenshots.

Methodology — judging and benchmark hygiene

Preference Leakage in LLM-as-judge (arxiv 2502.01534) — same-family judging inflates scores (~24% vs ~3% cross-family). The benchmark judges with a panel of non-agent families.
Panel of LLM evaluators / PoLL (arxiv 2404.18796) — a jury of smaller, disjoint-family judges beats a single large judge on human agreement, with less intra-model bias. The basis for the three-judge panel.
MT-Bench — Judging LLM-as-a-Judge (arxiv 2306.05685) — documents position, verbosity, and self-enhancement bias; origin of the swap-consistency check.
When Judgment Becomes Noise (arxiv 2509.20293) — LLM-judge verdicts carry large unexplained variance; report uncertainty rather than aggregate it away. Why results ship with n=3 error bars.
Silent Judge — LLM evaluator shortcuts (arxiv 2509.26072) — shortcut biases that motivate rubric-based judging with explicit oracle answers.
LastingBench (arxiv 2506.21614) — benchmark-leakage defense. Brain's distractor corpus is deterministic synthetic data for the same reason.