The NER Enrichment System automatically extracts and tags entities from episode content before ingesting them into the Graphiti knowledge graph. This enhances search accuracy and provides structured metadata about entities mentioned in the content.
The system recognizes and tags the following entity types:
| Entity Type | Tag | Description | Examples |
|---|---|---|---|
| Person | PER | People, including fictional | Kamala Harris, Gavin Newsom |
| Location | LOC | Countries, cities, states, geographic features | California, San Francisco, United States |
| Organization | ORG | Companies, agencies, institutions | FBI, Google, Stanford University |
| Date | DATE | Absolute or relative dates | January 3, 2011, 2020 |
| Time | TIME | Times smaller than a day | 3:00 PM, morning |
| Money | MONEY | Monetary values | $1 million, €500 |
| Percent | PERCENT | Percentage values | 50%, 3.5% |
| Facility | FAC | Buildings, airports, highways, bridges | Golden Gate Bridge |
| Product | PRODUCT | Objects, vehicles, foods, etc. | iPhone, Tesla Model 3 |
| Event | EVENT | Named hurricanes, battles, wars, sports events | World War II |
| Law | LAW | Named documents made into laws | Constitution, Civil Rights Act |
| Language | LANGUAGE | Any named language | English, Spanish |
| NORP | NORP | Nationalities, religious/political groups | American, Republican |
The system uses two methods for entity extraction:
- Uses the
en_core_web_smmodel - High accuracy for entity recognition
- Supports all entity types listed above
- Install:
pip install spacy && python -m spacy download en_core_web_sm
- Regex-based extraction
- Works without external dependencies
- Limited to:
- Capitalized names (PER/LOC)
- Date patterns (DATE)
- Year patterns (DATE)
- Lower accuracy but always available
# Original episode
episode = {
'content': 'Kamala Harris is the Attorney General of California. She was previously the district attorney for San Francisco.',
'type': EpisodeType.text,
'description': 'podcast transcript'
}
# After enrichment
enriched_episode = {
'content': '...', # Original content
'type': EpisodeType.text,
'description': 'podcast transcript',
'entities': [
{'text': 'Kamala Harris', 'type': 'PER', 'start': 0, 'end': 13},
{'text': 'Attorney General', 'type': 'ORG', 'start': 21, 'end': 37},
{'text': 'California', 'type': 'LOC', 'start': 41, 'end': 51},
{'text': 'San Francisco', 'type': 'LOC', 'start': 103, 'end': 116}
],
'entities_by_type': {
'PER': ['Kamala Harris'],
'ORG': ['Attorney General'],
'LOC': ['California', 'San Francisco']
},
'entity_count': 4
}The enriched content is annotated with entity information:
Original: "Kamala Harris is the Attorney General of California..."
Enriched: "Kamala Harris is the Attorney General of California...
[ENTITIES: PER: Kamala Harris; ORG: Attorney General; LOC: California, San Francisco]"
This annotation helps the LLM in Graphiti better understand the entities and their types when building the knowledge graph.
from maintest import EntityEnricher
# Initialize enricher
enricher = EntityEnricher(use_spacy=True)
# Enrich a single episode
episode = {
'content': 'Kamala Harris worked in California.',
'type': EpisodeType.text,
'description': 'example'
}
enriched = enricher.enrich_episode(episode)
# Access extracted entities
print(f"Found {enriched['entity_count']} entities")
for entity_type, entities in enriched['entities_by_type'].items():
print(f"{entity_type}: {', '.join(entities)}")
# Get enriched content for ingestion
enriched_content = enricher.create_enriched_content(enriched)# Set ADD = True in maintest.py
ADD = True
# The system will automatically:
# 1. Initialize EntityEnricher
# 2. Extract entities from each episode
# 3. Display extracted entities
# 4. Add enriched content to Graphiti- Entities are explicitly tagged with their types
- Helps distinguish between "California" (location) vs "California" (organization name)
- Better semantic understanding of content
- Each episode has structured entity information
- Can filter/search by entity type
- Enables entity-centric queries
- LLM receives entity type hints when building the graph
- More accurate node creation and relationship extraction
- Better entity disambiguation
- Extracted entities can be used to boost query matching scores
- Proper nouns are identified and weighted appropriately
- Location/person/organization queries are more accurate
================================================================================
📝 EPISODE INGESTION WITH NER ENRICHMENT
================================================================================
[Episode 0] Processing...
✓ Extracted 4 entities:
- PER: Kamala Harris
- ORG: Attorney General
- LOC: California, San Francisco
✓ Added to graph: Freakonomics Radio 0 (text)
[Episode 1] Processing...
✓ Extracted 3 entities:
- PER: Harris
- DATE: January 3, 2011, January 3, 2017
✓ Added to graph: Freakonomics Radio 1 (text)
================================================================================
✅ INGESTION COMPLETE
================================================================================
# Install spaCy and model
pip install spacy
python -m spacy download en_core_web_sm
# Use in code
enricher = EntityEnricher(use_spacy=True)# No installation needed
# Automatically used if spaCy is not available
enricher = EntityEnricher(use_spacy=False)The NER system enhances the query term matching component of the multi-factor ranking:
- Entity Extraction: Identifies proper nouns and their types
- Query Analysis: Extracts entities from the query
- Type-Aware Matching: Matches query entities with content entities
- Weighted Scoring: Entities get higher weights in query matching
Example:
Query: "Who was the California Attorney General?"
Without NER:
- Simple text matching
- "California" matches as generic term
With NER:
- "California" identified as LOC
- "Attorney General" identified as ORG/role
- Nodes with matching LOC and ORG entities rank higher
- More precise results
- Requires installation and model download (~12 MB)
- Slightly slower processing
- May require more memory
- Limited entity types (PER, LOC, DATE only)
- Lower accuracy
- May miss complex entity patterns
- No context-aware disambiguation
Potential improvements:
- Custom Entity Types: Add domain-specific entity types (e.g., "POLITICAL_POSITION", "GOVERNMENT_AGENCY")
- Entity Linking: Link entities to knowledge bases (Wikipedia, Wikidata)
- Coreference Resolution: Resolve pronouns to entities ("She" → "Kamala Harris")
- Relationship Extraction: Extract relationships between entities
- Multi-language Support: Support for non-English content
- Entity Confidence Scores: Provide confidence scores for each entity
- Entity Normalization: Normalize entity mentions ("AG" → "Attorney General")
Error: Can't find model 'en_core_web_sm'
Solution: python -m spacy download en_core_web_sm
Problem: High memory usage with spaCy
Solution: Use pattern-based method or process episodes in smaller batches
Problem: Pattern-based method missing entities
Solution: Install spaCy for better accuracy
- Speed: ~100-500 ms per episode (depends on length)
- Accuracy: ~90-95% for common entity types
- Memory: ~200-500 MB
- Speed: ~10-50 ms per episode
- Accuracy: ~60-70% for simple patterns
- Memory: ~10-20 MB
- Use spaCy for Production: Higher accuracy and more entity types
- Batch Processing: Process episodes in batches for better performance
- Cache Results: Cache enriched episodes to avoid re-processing
- Monitor Entity Counts: Track entity extraction metrics
- Validate Entities: Spot-check extracted entities for quality
- Update Models: Keep spaCy models updated for best performance