A production-inspired hybrid log classification system that intelligently combines Rule-Based Logic, Traditional ML, and Large Language Models (LLMs) to handle logs of varying complexity and structure.
This project implements a three-layer hybrid architecture to classify system logs efficiently:
- Regex Layer → Fast & deterministic rule-based filtering
- Sentence Transformer + Logistic Regression (BERT Layer) → Structured ML-based semantic classification
- LLM Layer (Groq + Llama 3.3) → Intelligent reasoning for complex and legacy log patterns
The system dynamically routes logs based on their source and complexity.
- Handles predictable and well-structured patterns.
- Ideal for:
User ActionSystem Notification
- Fast and explainable.
- Used as the first classification layer for non-legacy systems.
- Uses
all-MiniLM-L6-v2for embedding generation. - Applies a pre-trained Logistic Regression classifier.
- Handles complex patterns when labeled training data exists.
- Used as fallback when Regex does not classify.
- Used for ambiguous or poorly structured logs.
- Specifically routed for
LegacyCRMsource logs. - Classifies into:
Workflow ErrorDeprecation WarningUnclassified
This enables intelligent reasoning where rule-based and traditional ML fall short.
├── models/
│ └── log_classifier.joblib
├── resources/
│ ├── output.csv
│ └── test.csv
├── training/
│ ├── dataset/
│ │ └── synthetic_logs.csv
│ └── training.ipynb
├── .gitignore
├── classify.py
├── processor_bert.py
├── processor_llm.py
├── processor_regex.py
├── requirements.txt
└── server.py
Make sure Python is installed (recommended: Python 3.11).
pip install -r requirements.txt.env file configured with:
GROQ_API_KEY=your_api_key_here
python -m uvicorn server:app --reloadOnce running, access:
- API Base:
http://127.0.0.1:8000/ - Swagger Docs:
http://127.0.0.1:8000/docs
Upload a CSV file to the /classify/ endpoint.
| source | log_message |
|---|---|
| ModernCRM | ... |
| LegacyCRM | ... |
- LegacyCRM → LLM
- Others → Regex → (if no match) → BERT
The system returns a processed CSV file containing:
sourcelog_messagetarget_label
| source | log_message | target_label |
|---|---|---|
| ModernCRM | ... | User Action |
| LegacyCRM | ... | Workflow Error |
- ⚡ Hybrid Pipeline: Integrates Deterministic (Regex), ML (BERT), and Generative AI (LLM)
- 🧩 Source-based routing: Logic specifically handles legacy vs. modern system logs
- 🧠 Semantic fallback: Uses embedding-based classification when rules fail
- 🤖 LLM reasoning: Leverages Groq for complex edge cases
- 🚀 API-ready: Built with FastAPI for easy integration
- Do NOT commit:
.env,venv/, or large model files - Compatibility: Ensure
scikit-learnversion matches the training version when loading models - Connectivity: LLM calls require internet access and a valid Groq API key
- Confidence scoring aggregation across layers
- Batch LLM inference
- Model caching
- Monitoring & logging layer
- Dockerized deployment
Ninad Amane | LinkedIn | ninadamane@gmail.com