Benchmark Results

This page tracks current measurements from the benchmark suite. Numbers are reproducible — the harness is deterministic up to model-provider non-determinism, and the same scenario JSON + distractor seed always produces the same memory pool.

warning

Results below are smoke runs (1 run per arm) intended to validate that the harness works end-to-end and surface real signal. Full statistical-confidence runs (3-5 runs × all 6 scenarios × multiple agents) will follow. Treat individual numbers as directional, not definitive.

Run metadata

FieldValue
Date2026-05-28
ScenarioA — The Noisy Project Folder
Runs per arm1 (smoke)
Distractor haystack200 deterministic memories (seed=42)
Oracle memories3 (mem_orcl_cache_redis, mem_orcl_cursor_pagination, mem_orcl_vue_composable)
AgentsGemini Flash · OpenCode → DeepSeek V4 Pro
JudgeCross-family (Claude judges Gemini; Claude also judges OpenCode/DeepSeek)
Rubric size7 binary criteria

Scenario A — Gemini Flash

ArmInputOutputTotalTok/successRecall@5Pass
bare21,7492,35624,10524,105yes
fixture-only15,1711,66316,83416,834yes
brain-real25,5352,26527,80027,8000.33yes
brain-no-recall (dump bodies)40,3813,21843,59943,599yes
brain-no-pin80,9525,38686,33886,3380.33yes
context-dump (upper bound)48,1322,62450,75650,756yes

Raw JSON →

Headline: brain-real uses 28K tokens — only 16% more than bare (24K) — while every other memory mode is 1.6×–3.1× more expensive. brain-no-pin at 86K is dramatic: turning the pinned tier off triples token usage on this scenario.

All arms pass the judge at 1 run, so this isn't a correctness differentiator yet — it's an efficiency one. Multi-run results will surface whether the harder ablation arms (brain-no-pin, context-dump) also degrade correctness under retry pressure.

Scenario A — OpenCode → DeepSeek V4 Pro

ArmInputOutputTotalTok/successRecall@5Pass
bare19,62441820,04220,042yes
fixture-only13,54821413,76213,762yes
brain-real00timeoutno
brain-no-recall19,41830219,72019,720yes
brain-no-pin17,28822117,50917,5090.33yes
context-dump00timeoutno

Raw JSON →

Headline & caveat: Two arms timed out at the 300-second prompt limit — brain-real and context-dump. DeepSeek V4 Pro is slower per call than Gemini Flash, and the larger prompts (session-start with pinned tier; full memory contents dumped) exceeded the budget. The harness timeout has since been raised to 600s and a SIGKILL fallback added (see benchmark/harness/agents/opencode.js); the next run should not exhibit these timeouts.

What's interesting in the four arms that completed: DeepSeek's brain-no-pin is 17,509 tokenslower than bare (20,042). This is the opposite of Gemini's behavior. DeepSeek appears to use memory context to short-circuit tool-use iterations; Gemini lets memory context extend its exploration. Per-agent behavior matters — generalizations across agents will need multiple-run, multi-agent data.

Cross-agent reading

The same scenario produces very different patterns across models:

ArmGemini FlashOpenCode/DeepSeek V4 Pro
bare24,10520,042
fixture-only16,83413,762
brain-real27,800timeout
brain-no-recall43,59919,720
brain-no-pin86,33817,509
context-dump50,756timeout

DeepSeek runs much tighter on tokens overall but is more sensitive to long prompts (more timeouts). Gemini soaks more tokens at the high end (brain-no-pin 86K vs DeepSeek 18K) but completes more reliably. This is exactly the kind of comparison the arm matrix is built to surface.

What's next

Pending runs:

  • 5 runs × all 6 scenarios × Gemini Flash — for statistical confidence on the headline metrics
  • 1 run × all 6 scenarios × DeepSeek (V3.2 / deepseek-chat to keep cost manageable, then optionally V4 Pro on a subset)
  • Scenario E continual mode validated end-to-end (write-side memorize step)

Results will be appended to this page as JSON in /public/benchmarks/ and summarized inline.