File size: 1,446 Bytes
b2d9e47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# <YYYY-MM-DD><benchmark-name>

**Commit:** `<sha>`
**Bench:** LongMemEval `_s` / coding-agent-life-v1 / ...
**N:** 500 / 15 / ...
**K:** 5
**Hardware:** macos-15 / ubuntu-22.04 / ...
**OpenAI model:** text-embedding-3-small
**Anthropic model:** N/A (no LLM in retrieval loop)

## Headline

agentmemory-hybrid: **R@5 = XX.XX%**, P@5 = XX.XX%, p50 latency = XXms

Beats grep baseline by +X.Xpt R@5, vector by +X.Xpt R@5.

## Per-adapter

| Adapter | P@5 | R@5 | Hit rate | p50 latency |
|---|---|---|---|---|
| grep | | | | |
| vector | | | | |
| agentmemory-hybrid | | | | |

## Per-question-type

| Type | grep R@5 | vector R@5 | agentmemory R@5 |
|---|---|---|---|
| single-session-bug | | | |
| single-session-refactor | | | |
| preference | | | |
| multi-session-causal | | | |
| temporal | | | |

## Methodology

- Sessions ingested via `POST /agentmemory/remember` with `type=eval-session`
- Queries hit `POST /agentmemory/smart-search` with `limit=k*4`
- No LLM in retrieval loop. Direct rank from hybrid scoring.
- Ranks dedup by sessionId before truncating to K
- Latency measured as init+query for LongMemEval (per-question fresh state), query-only for coding-life (shared state)

## Reproduce

```sh
git checkout <sha>
npm install --legacy-peer-deps
OPENAI_API_KEY=sk-... AGENTMEMORY_BASE_URL=http://localhost:3111 \
  npm run eval:longmemeval -- --stratify 10
```

## Notes

<what surprised, what regressed, what's load-bearing>