-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Hi glossAPI maintainers
I’d like to contribute to the text data pipeline standards by adding a modern workflow for automated text preprocessing and AI enhancements.
The proposed pipeline covers:
- Text Normalization (Unicode, punctuation, whitespace cleanup)
- Heuristic Filtering (low-quality text removal, repetition removal)
- Deduplication (exact and near-duplicate detection)
- Sharding for scalable processing
Additionally, optional AI modules could be integrated for:
- PII Detection & Redaction
- Text Quality Scoring
The pipeline should also integrate with a backend API (FastAPI / Node.js) to allow real-time ingestion, processing, and AI-based enhancements.
Goal:
- Create a reproducible, modular, and automated pipeline
- Improve data quality and consistency across large-scale text datasets
I’ve also prepared a simple schematic diagram of the workflow (can attach later).
Looking forward to your feedback and suggestions. I’m happy to submit a Pull Request once the idea is approved.
Thank you!
Metadata
Metadata
Assignees
Labels
No labels