Spaces:
Running
Running
File size: 1,446 Bytes
b2d9e47 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | # <YYYY-MM-DD> — <benchmark-name>
**Commit:** `<sha>`
**Bench:** LongMemEval `_s` / coding-agent-life-v1 / ...
**N:** 500 / 15 / ...
**K:** 5
**Hardware:** macos-15 / ubuntu-22.04 / ...
**OpenAI model:** text-embedding-3-small
**Anthropic model:** N/A (no LLM in retrieval loop)
## Headline
agentmemory-hybrid: **R@5 = XX.XX%**, P@5 = XX.XX%, p50 latency = XXms
Beats grep baseline by +X.Xpt R@5, vector by +X.Xpt R@5.
## Per-adapter
| Adapter | P@5 | R@5 | Hit rate | p50 latency |
|---|---|---|---|---|
| grep | | | | |
| vector | | | | |
| agentmemory-hybrid | | | | |
## Per-question-type
| Type | grep R@5 | vector R@5 | agentmemory R@5 |
|---|---|---|---|
| single-session-bug | | | |
| single-session-refactor | | | |
| preference | | | |
| multi-session-causal | | | |
| temporal | | | |
## Methodology
- Sessions ingested via `POST /agentmemory/remember` with `type=eval-session`
- Queries hit `POST /agentmemory/smart-search` with `limit=k*4`
- No LLM in retrieval loop. Direct rank from hybrid scoring.
- Ranks dedup by sessionId before truncating to K
- Latency measured as init+query for LongMemEval (per-question fresh state), query-only for coding-life (shared state)
## Reproduce
```sh
git checkout <sha>
npm install --legacy-peer-deps
OPENAI_API_KEY=sk-... AGENTMEMORY_BASE_URL=http://localhost:3111 \
npm run eval:longmemeval -- --stratify 10
```
## Notes
<what surprised, what regressed, what's load-bearing>
|