Benchmark Results

This page tracks current measurements from the benchmark suite. Numbers are reproducible — the harness is deterministic up to model-provider non-determinism, and the same scenario JSON + distractor seed always produces the same memory pool. Regenerate any table below with node harness/summarize.js results/<file>.json and the statistics with node harness/analyze.js results/<file>.json.

warning

Run at n=3 per arm. The token differences are directional, not statistically significant (Mann–Whitney U is not significant at this sample size — see the 90% CIs in the stats file). The pass-rate gradient is the result. Two scenarios (C, F) are honest nulls — reported as-is rather than dropped.

Run metadata

Field	Value
Date	2026-06-26
Agent under test	DeepSeek V4 Pro (single-shot, `deepseek-v4-pro`)
Judge	Cross-family panel — Gemini 2.5 Flash + Gemma-4 12B + Qwen-3.5 9B, majority vote, each criterion majority-voted
Runs per arm	3
Scenarios	A, B, C, D, F (E — continual — deferred)
Distractor haystack	up to 1000 deterministic memories (seed=42)

Why single-shot? The model receives the task plus whatever the arm injects (nothing, a retriever's top-k, brain's session-start payload, or the oracle) and replies in one turn. This isolates memory's effect — there is no agentic file exploration to rediscover conventions — and keeps token counts clean and comparable (no cache-read inflation). It mirrors the Mem0 / LongMemEval framing.

Scenario A — Noisy Project Folder (retrieval under 1000 distractors)

Arm	tok/success	Recall@5	rubric score	pass
no-memory (floor)	—	—	0.43	0%
vector (embeddings)	—	0.00	0.33	0%
keyword (BM25)	—	0.33	0.48	0%
brain-full	3,199	0.67	1.00	100%
brain-no-pin	5,135	0.33	0.67	67%
brain-no-recall	3,643	—	0.95	100%
oracle (ceiling)	3,572	1.00	1.00	100%
context-dump 8k	5,856	—	1.00	100%
context-dump 60k	20,689	—	1.00	100%

Headline. Under a 1000-distractor haystack, brain is the only retrieval method whose memories let the model succeed — both BM25 (Recall@5 0.33) and a local vector store (Recall@5 0.00) fail to surface the three oracle memories, and the model fails with them. Brain retrieves 2/3 (Recall@5 0.67), which is enough to pass, and it does so at the lowest tokens-per-success of any passing arm (3,199 — below the oracle's 3,572 and far below the 8k/60k dumps). Disabling the pinned tier drops brain to 67% and doubles its Recall miss — the ablation that isolates pinning's value.

Scenario B — Three Sessions, One Decision (continuity)

Arm	tok/success	Recall@5	rubric score	pass
fixture-only (floor)	2,546	—	0.62	67%
keyword (BM25)	2,721	1.00	0.95	100%
brain-real	1,871	1.00	1.00	100%
brain-no-pin	2,072	1.00	1.00	100%
oracle (ceiling)	1,734	1.00	1.00	100%

Headline. Postgres was decided across three sessions despite a discarded Mongo prototype. Brain recalls it perfectly (Recall@5 1.0) and passes at the fewest tokens-per-success of the memory arms. Here the haystack is small (100 distractors), so plain BM25 also retrieves the decision and passes — on this scenario brain's edge is efficiency, not correctness.

Scenario C — The Contradiction Test (tabs → spaces → tabs)

Arm	tok/success	Recall@5	rubric score	pass
fixture-only (floor)	1,218	—	0.78	67%
keyword (BM25)	1,775	1.00	0.72	33%
brain-real	1,572	1.00	0.83	67%
brain-no-pin	1,058	1.00	0.83	67%
oracle (ceiling)	915	1.00	0.83	67%
dump-all-chrono	469	—	0.94	100%

Headline — an honest null. Everything clusters near 67%. Indentation (tabs vs spaces) is a noisy signal to grade from a single-shot text reply, and the arm that does best is dump-all-chrono, which simply concatenates all three versions in time order so the latest (tabs) wins. Brain retrieves the final decision (Recall@5 1.0) but doesn't convert that into a clear pass-rate win here. We report it rather than hide it.

Scenario D — Skill Progressive Disclosure (token efficiency)

Arm	tok/success	rubric score	pass
fixture-only (floor)	—	0.33	0%
brain-skills L0 (index only)	—	0.50	0%
brain-skills loaded (L1)	1,580	1.00	100%
brain-skills all-loaded	2,289	1.00	100%

Headline. Loading just the one relevant skill passes 100% at 1,580 tokens-per-success — ~31% leaner than dumping every skill body (2,289). The index-only (L0) and no-skills arms fail: a single-shot model can't act on a skill index the way an agent would (read the index, then load the matching SKILL.md). That on-demand load is exactly what the skills tier automates — and when it fires, it's both correct and the cheapest passing arm.

Scenario F — Abstention (no confabulation)

Arm	tok/success	rubric score	pass
fixture-only (floor)	775	1.00	100%
keyword (BM25)	1,238	0.89	67%
brain-real	920	1.00	100%
oracle (ceiling)	890	1.00	100%

Headline — a second honest null. Asked to set up a deployment pipeline with no deployment target in memory, DeepSeek abstains correctly with or without brain (100%) — the base model already declines to invent a target. A noisy keyword retriever, which injects irrelevant top-k, actually drags it down to 67%. Memory neither helps nor hurts abstention in this case.

Reading the suite

Brain wins where it is designed to — retrieval under heavy noise (A) and procedural-skill efficiency (D) — passing where the BM25 and vector baselines fail, at the lowest token cost among passing arms. It is efficient and competitive on continuity (B), and neutral on contradiction (C) and abstention (F), where the base model needs no memory. That mixed, baseline-anchored picture — floor, oracle ceiling, real retriever baselines, and brain ablations on every scenario — is the point: the wins are attributable to the memory system, and the nulls are reported honestly.

Raw JSON →

Retrieval scoring calibration — in-vivo probe (July 2026)

info

This section is not part of the controlled suite above — no judge panel, no distractor haystack, no floor/ceiling arms. It is a small in-vivo probe on a real working brain, run to measure a scoring-calibration fix. Read it as engineering verification, not as a benchmark claim.

The problem. BM25 relevance was normalized per-query by the top score, so the best available match for any query reported relevance 1.0 — including queries the brain knew nothing about. A four-term nonsense control query returned ten results against a real 177-memory brain, the top one scoring a confident-looking composite of 0.679. An agent trusting those scores injects noise into its context.

The fix (post-beta.30). Three coordinated changes: relevance is scaled by IDF-weighted query coverage (matching one term of a four-term query caps relevance near 0.25, however strong that single match); explicit-query recall applies a relevance floor so zero-relevance memories can't pad the top-N on strength alone; and spreading-activation sources are relevance-gated in query mode — activation spreads outward from the query-relevant set, so a strong-but-irrelevant clique can't rescue itself past the floor, while associates of genuinely relevant memories still surface.

The probe. A real working brain (177 memories, 1,245 association edges), 20 tag-derived topical queries with known target memories, plus the nonsense control. Same brain, same queries, before and after:

Metric	Before (beta.30)	After
hit@1	90%	100%
hit@3	100%	100%
MRR	0.950	1.000
Nonsense control	10 results, top score 0.679, relevance "1.0"	1 result, score 0.396, relevance 0.213
Recall latency (p50)	49 ms	46 ms

Coverage scaling did not just fix the calibration — it improved ranking: the two probe queries whose targets previously ranked #2 moved to #1, because partial-match distractors no longer outrank full-coverage targets.

Caveats. n=20, a single brain, self-run, and tag-derived queries favor the index's weighted tag field. The honest headline is the calibration change — irrelevant queries now return honestly low scores or nothing — with the ranking improvement as directional evidence pending a controlled re-run (below).

What's next

Re-run the controlled suite (Scenario A especially) on the calibrated scoring — coverage scaling and the relevance floor should sharpen brain-full's Recall@5 under the 1000-distractor haystack.
More runs (5–10/arm) for statistical-confidence error bars on the token metric.
Scenario E (continual coding, write-side memorize) validated end-to-end.
A second agent under test (e.g. a stronger reasoning model) to separate memory effects from base-model competence.