Skip to content

Suggestion: Add reproducible memory recall benchmarks for A-MEM evaluation #30

@yubingz

Description

@yubingz

Hi A-MEM team,

Really impressive work on the Zettelkasten-inspired memory system — the NeurIPS 2025 paper is one of the most cited in the memory space this year. The link generation + memory evolution mechanism is elegant.

One thing I noticed: while the paper shows strong results on LoCoMo and other benchmarks, there's currently no easy way for the community to reproduce and extend these evaluations with new test data. The evaluation pipeline seems tightly coupled to specific benchmark datasets.

I've been working on MemTest, which takes a different approach: instead of providing evaluation metrics, it provides test databases designed to stress-test different aspects of memory retrieval:

  • 6 evaluation dimensions: storage integrity, retrieval precision (5 query types including temporal), clustering, forgetting directionality, reasoning (multi-hop chains), deep retrieval decay
  • Two data generators: procedural (100-10K memories with controlled randomness) and corpus-driven (feed any text corpus, auto-extract facts + generate balanced queries)
  • Four Great Classical Novels dataset: 21,793 memories + 750 queries, with rich entity relationships perfect for testing link generation quality

For A-MEM specifically, this could help:

  • Link generation evaluation: Do generated links actually connect semantically related memories? MemTest's clustering dimension tests exactly this.
  • Memory evolution quality: When memories evolve, does retrieval quality improve or degrade?
  • Cross-domain robustness: How well does A-MEM perform on non-conversational data (literature, technical docs)?

We found that on Chinese classical text, TF-IDF + LLM reranking achieves 87% precision vs 9.1% for sentence-transformers — the embedding model choice matters more than most people realize.

Would adding a standardized eval module be of interest? I'd be happy to build an A-MEM adapter for MemTest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions