Hi A-MEM team,
Really impressive work on the Zettelkasten-inspired memory system — the NeurIPS 2025 paper is one of the most cited in the memory space this year. The link generation + memory evolution mechanism is elegant.
One thing I noticed: while the paper shows strong results on LoCoMo and other benchmarks, there's currently no easy way for the community to reproduce and extend these evaluations with new test data. The evaluation pipeline seems tightly coupled to specific benchmark datasets.
I've been working on MemTest, which takes a different approach: instead of providing evaluation metrics, it provides test databases designed to stress-test different aspects of memory retrieval:
- 6 evaluation dimensions: storage integrity, retrieval precision (5 query types including temporal), clustering, forgetting directionality, reasoning (multi-hop chains), deep retrieval decay
- Two data generators: procedural (100-10K memories with controlled randomness) and corpus-driven (feed any text corpus, auto-extract facts + generate balanced queries)
- Four Great Classical Novels dataset: 21,793 memories + 750 queries, with rich entity relationships perfect for testing link generation quality
For A-MEM specifically, this could help:
- Link generation evaluation: Do generated links actually connect semantically related memories? MemTest's clustering dimension tests exactly this.
- Memory evolution quality: When memories evolve, does retrieval quality improve or degrade?
- Cross-domain robustness: How well does A-MEM perform on non-conversational data (literature, technical docs)?
We found that on Chinese classical text, TF-IDF + LLM reranking achieves 87% precision vs 9.1% for sentence-transformers — the embedding model choice matters more than most people realize.
Would adding a standardized eval module be of interest? I'd be happy to build an A-MEM adapter for MemTest.
Hi A-MEM team,
Really impressive work on the Zettelkasten-inspired memory system — the NeurIPS 2025 paper is one of the most cited in the memory space this year. The link generation + memory evolution mechanism is elegant.
One thing I noticed: while the paper shows strong results on LoCoMo and other benchmarks, there's currently no easy way for the community to reproduce and extend these evaluations with new test data. The evaluation pipeline seems tightly coupled to specific benchmark datasets.
I've been working on MemTest, which takes a different approach: instead of providing evaluation metrics, it provides test databases designed to stress-test different aspects of memory retrieval:
For A-MEM specifically, this could help:
We found that on Chinese classical text, TF-IDF + LLM reranking achieves 87% precision vs 9.1% for sentence-transformers — the embedding model choice matters more than most people realize.
Would adding a standardized eval module be of interest? I'd be happy to build an A-MEM adapter for MemTest.