Skip to content

08 — Eval

Six suites ship with the library. Run them all:

memnex eval --suite full

Suites

Suite Measures Target
identity_resolution F1 on resolving synthetic identifiers >99% deterministic, >85% fuzzy
recall F1 on factual questions after cross-channel sessions >80% (vs Mem0 ≈49%, Zep ≈64%)
handoff Voice → WhatsApp info retention + noise >90% retention, <10% noise
latency p50/p95/p99 for write/read/resolve under 500 iterations read p50 <5ms cached
conflict Precision/recall on contradictory fact pairs >85% precision, >80% recall
load Concurrent ops/s at N agents scales linearly to N=10k

Datasets

Small, reproducible, shipped in-tree:

Swap in your own — same schema.

Sample output

{
  "results": {
    "identity_resolution": {"f1": 1.0, "precision": 1.0, "recall": 1.0},
    "recall":              {"f1": 1.0, "questions": 3},
    "handoff":             {"retention": 0.67, "format_appropriate_rate": 1.0},
    "latency":             {"write_ms": {"p50": 0.19, "p95": 0.23, "p99": 0.26}},
    "conflict":            {"precision": 1.0, "recall": 0.33, "tp": 1, "fp": 0, "fn": 2},
    "load":                {"throughput_ops_s": 12092}
  }
}

(In-memory backend; Postgres/Redis numbers will be higher latency, same ratios.)

Extending

Write a new suite in src/memnex/eval/suites/<name>.py:

async def run(mx):
    # your benchmark
    return {"suite": "my_suite", "metric": ...}

Register it in eval/runner.py.