Suggestion: Add reproducible memory recall benchmarks for A-MEM evaluation

Hi A-MEM team,

Really impressive work on the Zettelkasten-inspired memory system — the NeurIPS 2025 paper is one of the most cited in the memory space this year. The link generation + memory evolution mechanism is elegant.

One thing I noticed: while the paper shows strong results on LoCoMo and other benchmarks, there's currently no easy way for the community to reproduce and extend these evaluations with new test data. The evaluation pipeline seems tightly coupled to specific benchmark datasets.

I've been working on [MemTest](https://github.com/yubingz/memtest), which takes a different approach: instead of providing evaluation metrics, it provides **test databases** designed to stress-test different aspects of memory retrieval:

- **6 evaluation dimensions**: storage integrity, retrieval precision (5 query types including temporal), clustering, forgetting directionality, reasoning (multi-hop chains), deep retrieval decay
- **Two data generators**: procedural (100-10K memories with controlled randomness) and corpus-driven (feed any text corpus, auto-extract facts + generate balanced queries)
- **Four Great Classical Novels dataset**: 21,793 memories + 750 queries, with rich entity relationships perfect for testing link generation quality

For A-MEM specifically, this could help:
- **Link generation evaluation**: Do generated links actually connect semantically related memories? MemTest's clustering dimension tests exactly this.
- **Memory evolution quality**: When memories evolve, does retrieval quality improve or degrade?
- **Cross-domain robustness**: How well does A-MEM perform on non-conversational data (literature, technical docs)?

We found that on Chinese classical text, TF-IDF + LLM reranking achieves 87% precision vs 9.1% for sentence-transformers — the embedding model choice matters more than most people realize.

Would adding a standardized eval module be of interest? I'd be happy to build an A-MEM adapter for MemTest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Add reproducible memory recall benchmarks for A-MEM evaluation #30

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Suggestion: Add reproducible memory recall benchmarks for A-MEM evaluation #30

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions