English | 简体中文
A highly automated, low-manual-intervention tool for generating bilingual anime subtitles (ASS). This project integrates advanced Voice Activity Detection (fsmn-vad), Audio Event Classification (SenseVoice), Speech Recognition (Stable-Whisper), and LLM Translation with Correction (DeepSeek API). Through a four-stage "Phonetic Alchemy" pipeline, it batch-converts raw video files into styled, high-quality Chinese-Japanese bilingual subtitles at high speed.
- Python 3.10+
- GPU (Recommended):
- NVIDIA GPU (RTX 3060 Ti or equivalent recommended) + CUDA 11.8+
- AMD GPU (RX 6000/7000 series) + ROCm 6.x (Linux only)
- Apple Silicon Mac (M1/M2/M3 series)
- CPU Mode: Supported without GPU, but slower (~10-20x longer)
- ffmpeg (Required for audio extraction)
| Device Type | RAM | VRAM | Whisper Model |
|---|---|---|---|
| NVIDIA GPU (10GB+) | 3 GB | 10 GB+ | large-v3 |
| NVIDIA GPU (6-10GB) | 3 GB | 6-10 GB | medium |
| AMD GPU (ROCm) | 3 GB | 8 GB+ | medium |
| Apple Silicon | 8 GB+ | Shared | medium |
| CPU | 8 GB+ | - | small |
Actual consumption may vary based on video length and concurrency settings.
# 1. Clone the project
git clone https://github.com/misakayyds/AnimeTranslator.git
cd AnimeTranslator
# 2. Create virtual environment
python -m venv env
# 3. Activate virtual environment
# Windows:
env\Scripts\activate
# Linux/Mac:
source env/bin/activate
# 4. Install PyTorch (Choose based on your device)
# NVIDIA GPU (CUDA 12.8):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128
# NVIDIA GPU (CUDA 12.1):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
# NVIDIA GPU (CUDA 11.8):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
# AMD GPU (ROCm 6.2, Linux only):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
# Apple Silicon Mac (MPS):
pip install torch torchaudio
# CPU only (no GPU):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
# 5. Install AnimeTranslator
pip install -e .
# 6. Configure API Key
copy .env.example .env
# Edit .env file and fill in your DeepSeek API KeyThis project requires ffmpeg for audio extraction.
Windows:
# Option 1: Using winget
winget install ffmpeg
# Option 2: Using Chocolatey
choco install ffmpeg
# Option 3: Manual download
# Download from https://www.gyan.dev/ffmpeg/builds/, extract and add bin directory to PATHLinux (Ubuntu/Debian):
sudo apt update && sudo apt install ffmpegmacOS:
brew install ffmpegVerify installation:
ffmpeg -version- Visit DeepSeek Platform
- Register and login
- Create a new API Key on the API Keys page
- Fill the API Key in
DEEPSEEK_API_KEYin the.envfile
# Ensure you're in the project directory with virtual environment activated
animetranslator webui
# Or using Python module
python -m animetranslator webui
# Custom port
animetranslator webui --port 8080
# Generate public share link
animetranslator webui --shareAfter startup, open http://127.0.0.1:7860 in your browser
# Permanent monitoring mode
animetranslator watch
# Auto-shutdown mode (shutdown 60 seconds after completion)
animetranslator watch --shutdown- Place video files in the
Input/directory (supports nested subfolders)- Supported formats:
.mkv,.mp4,.avi,.mov,.flv,.wmv,.webm
- Supported formats:
- Run
animetranslator webuioranimetranslator watch - Wait for processing to complete
- Get bilingual
.asssubtitle files in theOutput/directory
Raw Video File
│
▼
📡 Stage 1: Resonant Perception (SenseVoice)
│ fsmn-vad precision ranging → start/end time for each audio segment
│ SenseVoice label recognition → BGM / Speech / MUSIC classification
│
▼
🚧 Stage 2: Impurity Separation (Filtering)
│ OP/ED continuity detection (85~95s continuous music → discard whole block)
│ Pure music filtering (BGM/MUSIC label + no text → discard)
│
▼
🎯 Stage 3: Essence Extraction (Whisper)
│ Tensor slicing (in-memory, zero disk IO)
│ Feed each segment to Stable-Whisper Large-v3
│ Smart sentence breaking + deduplication + breath filler filtering
│
▼
🔬 Stage 4: Transmutation (Quality Check + Translation)
│ no_speech_prob > 0.7 → discard (ambient noise disguise)
│ compression_ratio > 2.8 → discard (repetition hallucination)
│
▼
✅ Output _alignment.json draft → DeepSeek translation → bilingual ASS subtitle
-
SenseVoice Smart Label Classification: Replaces traditional Demucs vocal separation, directly labeling BGM/Speech/MUSIC on original audio, preserving full context. GPU time reduced from minutes to seconds.
-
OP/ED Impurity Auto-Detection: Anime-specialized feature that automatically detects continuous music segments (85~95 seconds) within the first and last 5 minutes, discarding entire blocks to prevent Whisper from transcribing lyrics.
-
In-Memory Tensor Slicing Zero IO: Audio waveform loaded to VRAM once, all subsequent segments extracted via Tensor Slicing with 0.3s padding on both ends to prevent audio clipping, avoiding repeated ffmpeg encoding/decoding.
-
Four-Layer Hallucination Prevention: SenseVoice labels → OP/ED filtering → Whisper no_speech_prob / compression_ratio quality check → breath filler regex filtering, intercepting at every layer.
-
DeepSeek Context-Aware Translation:
- Infers and corrects homophones, character names, chuunibyou coined words and spells based on pronunciation
- Contextually translates onomatopoeia based on actions, emotions, and even physiological reactions, preserving visual gags
- Intelligently filters out common Whisper hallucinations like "Thanks for watching" and "Please subscribe"
-
Smart Hardware & Cache Management:
- Three engines resident in VRAM: fsmn-vad + SenseVoice + Stable-Whisper large model, no need to reload per episode
- Periodic VRAM cleanup strategy, avoiding OOM ghost bugs
-
Structured Logging System:
- Unified logging with both console and file output
- Log files saved in
Output/logs/directory, format:animetranslator_YYYYMMDD_HHMMSS.log - Timestamped log format for easier troubleshooting
-
Automatic Configuration Validation:
- Automatic validity check of configuration items at startup
- API Key format validation, parameter range checking
- Clear problem prompts, reducing configuration errors
Edit the .env file in the project root:
# DeepSeek API key configuration (required)
DEEPSEEK_API_KEY=sk-xxxxxx
# Device configuration (optional: auto/cuda/mps/cpu, default auto)
# auto - Auto-detect, priority: CUDA > MPS > CPU
# cuda - Force NVIDIA GPU
# mps - Force Apple Silicon GPU
# cpu - Force CPU (slower)
DEVICE=auto
# Whisper model size (optional, auto-selected based on device)
# large-v3 / medium / small
# WHISPER_MODEL=large-v3
# Async concurrent LLM translation tasks (recommended 3-5)
MAX_API_WORKERS=3
# Engine VRAM residency config (set how many episodes before clearing model VRAM, recommended 1-5)
ALIGNMENT_BATCH_SIZE=3
# Quality check thresholds (usually no need to modify)
NO_SPEECH_PROB_THRESHOLD=0.7
COMPRESSION_RATIO_THRESHOLD=2.8AnimeTranslator/
├── pyproject.toml # Package config
├── README.md
├── LICENSE
├── .env.example # Config template
├── .gitignore
├── requirements.txt # Dependencies reference
├── src/
│ └── animetranslator/ # Core code
│ ├── __init__.py
│ ├── __main__.py
│ ├── cli.py # CLI entry
│ ├── config.py # Config management
│ ├── logger.py # Logging module
│ ├── device.py # Device management
│ ├── alignment.py # Alignment engine
│ ├── translation.py # Translation module
│ ├── watcher.py # Background watcher
│ └── webui.py # WebUI interface
├── tests/
├── Input/ # Input directory (videos to process)
├── Output/ # Output directory (generated subtitles + logs)
│ └── logs/ # Log files directory
└── env/ # Virtual environment (user created)
During the auto-shutdown countdown, press Win+R and enter shutdown /a to cancel.
On first run, FunASR automatically downloads SenseVoice-Small and fsmn-vad models from ModelScope (~1-2GB), please wait patiently.
Multiple common video formats are supported: .mkv, .mp4, .avi, .mov, .flv, .wmv, .webm
- WebUI mode: View real-time logs in browser
- Watch mode: View output logs in terminal
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Code formatting
black src/
# Code linting
ruff check src/See docs/ROADMAP.md for detailed development roadmap
- Lightweight model options: Support Whisper Medium/Small, reducing VRAM requirement to 4GB
- Alternative ASR engines: Integrate WhisperX, Moonshine and other alternatives
- Multi-language support: Extend to Korean, English and other language subtitle generation
- OpenAI GPT: Support GPT-4o, GPT-4-turbo models
- Claude: Support Anthropic Claude series
- Google Gemini: Support Google's latest LLM
- Qwen: Support Alibaba Cloud Qwen API
- GLM: Support Zhipu ChatGLM API
- OpenAI-compatible interfaces: Support any OpenAI-compatible API service
- Ollama integration: Support running local models via Ollama
- llama.cpp: Support GGUF format quantized models
- vLLM: Support high-performance local inference
- Recommended models: Qwen2.5, Llama3, GLM-4 and other open-source model adaptations
- Offline mode: Complete offline operation solution
- NVIDIA GPU (CUDA): RTX series, CUDA 11.8+
- AMD GPU (ROCm): RX 6000/7000 series, Linux only, requires ROCm driver and PyTorch ROCm version
- Apple Silicon (MPS): M1/M2/M3 series GPU acceleration
- CPU-only mode: Support for users without GPU, trading speed for accessibility
- Auto device detection: Automatically select optimal device and model based on hardware
- WebUI internationalization (i18n)
- Docker one-click deployment
- More video format support (MP4, AVI, MOV, FLV, etc.)
- Subtitle editor: Real-time preview and manual correction
- Batch processing progress tracking and resume optimization
Contributions via Issues or PRs are welcome!
This project is licensed under the MIT License. See LICENSE file for details.
- FunASR - SenseVoice speech recognition
- Stable-Whisper - Stable Whisper
- DeepSeek - LLM translation
- Gradio - WebUI framework