A comprehensive, modular transcription pipeline featuring AI-enhanced accuracy, flexible output formats, and professional-grade subtitle generation.
- 🔊 Advanced Audio Processing - Noise reduction and optimization for better transcription
- 🧠 Multiple Whisper Models - From fast (tiny) to most accurate (large)
- 🤖 AI Context Correction - GPT-4 powered grammar and homophone fixes
- 🌐 Multi-language Translation - Translate transcripts while preserving timing
- 🎭 Speaker Diarization - Automatic speaker identification
- 📄 Multiple Output Formats - FCPXML, ITT, Markdown, JSON
- 🔧 Modular Architecture - Use individual modules or the unified pipeline
- 💡 User-Friendly Interface - Clear explanations and smart defaults
# Clone or download the project
cd transcriber-v2.0
# Run the installation script
./install.sh
# Or install manually:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt# Copy the template and add your API keys
cp .env.template .env
# Edit .env and add your API keys:
# - OpenAI API key for AI features (context correction, translation)
# - Hugging Face token for advanced speaker diarization- Purpose: Enables AI context correction and multi-language translation
- Get your key: OpenAI API Keys
- Add to .env:
OPENAI_API_KEY=sk-your_key_here - Without it: Basic transcription still works, but no AI enhancements
- Purpose: Enables advanced AI-powered speaker diarization
- Get your token: Hugging Face Tokens
- Add to .env:
HUGGINGFACE_TOKEN=hf_your_token_here - Without it: Falls back to simple timing-based speaker detection
- Note: Required to access the
pyannote/speaker-diarization-3.1model
# Start the interactive pipeline
python run_transcription_pipeline_v2.py
# Follow the guided interface to:
# 1. Choose output format (FCPXML, ITT, Markdown, JSON)
# 2. Select input files
# 3. Configure processing options
# 4. Generate results- Python 3.8+ (Python 3.9+ recommended)
- Disk Space: ~3-4GB for full installation with models
- RAM: 4GB minimum, 8GB+ recommended for large models
- FFmpeg (for audio preprocessing - highly recommended)
The following packages are automatically installed via pip install -r requirements.txt:
- torch (~500MB+) - PyTorch for ML model support
- torchaudio - Audio processing for PyTorch
- openai-whisper - Speech transcription models
- pyannote.audio - AI speaker diarization (requires HF token)
- whisperx - Enhanced Whisper with word-level alignment
- librosa - Advanced audio analysis (optional)
- numpy, scipy - Mathematical operations
- openai - OpenAI API client for AI features
- python-dotenv - Environment variable management
- huggingface_hub - Hugging Face model access
- lxml - XML processing (FCPXML, ITT generation)
- tqdm - Progress bars
- ffprobe-python - Video metadata extraction
- FFmpeg - Audio/video processing (install separately)
- macOS:
brew install ffmpeg - Ubuntu:
sudo apt-get install ffmpeg - Windows: Download from ffmpeg.org
- macOS:
- OpenAI API Key (optional - for AI context correction and translation)
- Hugging Face Token (optional - for advanced speaker diarization)
The v2.0 system is built on modular principles:
- Modularity: Each script does one thing well
- User Choice: Maximum flexibility at each decision point
- Transparency: Clear cost estimates and feature explanations
- Quality: Professional-grade outputs for video production workflows
Each module can be used independently:
# 1. Basic transcription
python scripts/transcribe.py --input-dir input --output-dir output --model base --preprocessing
# 2. Speaker diarization
python scripts/diarize_transcript.py --input-dir transcripts --output-dir diarized
# 3. AI context correction
python scripts/context_correct_transcript.py --input-dir transcripts --output-dir corrected --api-key $OPENAI_API_KEY
# 4. Translation
python scripts/translate_transcript.py --transcript-dir transcripts --output-dir translated --target-language Spanish --api-key $OPENAI_API_KEY
# 5. Generate outputs
python scripts/generate_fcpxml.py --input-dir transcripts --output-dir fcpxml
python scripts/generate_itt.py --input-dir transcripts --output-dir itt
python scripts/generate_markdown.py --input-dir transcripts --output-dir markdown --include-timecodes --include-speakers-
Transcription (Whisper)
- Script:
transcribe.py - Function: Convert audio to text with timestamps.
- Options: Select model size (tiny, base, small, medium, large).
- Script:
-
Diarization
- Script:
diarize_transcript.py - Function: Add speaker labels to transcript using AI models.
- Options: Advanced AI-based (requires HF token) or simple timing-based.
- Models: pyannote.audio for professional speaker identification.
- Script:
-
Context Correction (AI)
- Script:
context_correct_transcript.py - Function: Automatic homophone/grammar correction.
- Options: Enable/disable context correction.
- Script:
-
Translation (OpenAI)
- Script:
translate_transcript.py - Function: Translate text to selected language.
- Options: Enable/disable translation, select target language.
- Script:
-
Subtitle Generation (FCPXML, ITT)
- Scripts:
generate_fcpxml.py,generate_itt.py - Function: Convert transcript to subtitle files.
- Options: Include timecodes, speaker names.
- Scripts:
-
Markdown Generation
- Script:
generate_markdown.py - Function: Output readable transcripts.
- Options: Include timecodes, speaker names.
- Script:
-
JSON Export
- Script:
export_json.py - Function: Output raw transcription data.
- Options: Enable/disable preprocessing.
- Script:
Primary Test Video: sample_video.mp4 (example test file)
- Size: Variable depending on your test file
- Should contain clear speech suitable for transcription, translation, and diarization testing
- Use the same test file for ALL test scenarios to ensure consistent comparison across routes
fcpxml/
├── basic_english/ # Raw transcription → FCPXML
├── corrected_english/ # Transcription → Context Correct → FCPXML
├── translated_spanish/ # Transcription → Translate(Spanish) → FCPXML
├── corrected_translated_mandarin/ # Transcription → Context Correct → Translate(Mandarin) → FCPXML
└── model_comparison/ # Same file with different Whisper models
itt/
├── basic_english/ # Raw transcription → ITT
├── corrected_english/ # Transcription → Context Correct → ITT
├── translated_french/ # Transcription → Translate(French) → ITT
└── multilanguage_comparison/ # Same content in multiple target languages
markdown/
├── basic_transcript/ # Raw transcription → Markdown (no speakers, no timecodes)
├── with_timecodes/ # Raw transcription → Markdown (timecodes only)
├── with_speakers/ # Transcription → Diarize → Markdown (speakers, no timecodes)
├── full_featured/ # Transcription → Diarize → Markdown (speakers + timecodes)
├── corrected_content/ # Transcription → Context Correct → Markdown
├── translated_content/ # Transcription → Translate → Markdown
└── complete_pipeline/ # Transcription → Diarize → Context Correct → Translate → Markdown
json/
├── raw_whisper_output/ # Pure Whisper transcription
├── diarized_content/ # Transcription → Diarize → JSON
├── context_corrected/ # Transcription → Context Correct → JSON
├── translated_versions/ # Transcription → Translate → JSON
└── full_processing/ # All processing steps → JSON
- Extract
transcribe.pyfrom existing transcription logic - Extract
context_correct_transcript.pyfrom context correction functions - Copy and adapt
translate_transcript.py(already modular) - Extract
generate_fcpxml.pyfrom FCPXML generation logic - Extract
generate_itt.pyfrom ITT generation logic - Create new
generate_markdown.pywith flexible options - Create
diarize_transcript.pyfrom existing diarization logic
- Create
run_transcription_pipeline_v2.pythat orchestrates modules - Implement decision tree with all user choices
- Preserve existing features: cost estimation, video metadata, JSON reuse
- Add new features: flexible markdown options, module selection
- Create test runner script
- Execute all test scenarios automatically
- Generate outputs in organized directory structure
- Create validation reports
transcriber-v2.0/
├── README.md # Project documentation
├── LICENSE # MIT License
├── requirements.txt # Python dependencies
├── setup.py # Package setup
├── install.sh # Installation script
├── .env.template # Environment template
├── .gitignore # Git ignore rules
├── run_transcription_pipeline_v2.py # Unified pipeline interface
├── scripts/ # Modular scripts
│ ├── transcribe.py # Whisper transcription
│ ├── diarize_transcript.py # Speaker identification
│ ├── context_correct_transcript.py # AI grammar correction
│ ├── translate_transcript.py # Multi-language translation
│ ├── generate_fcpxml.py # Final Cut Pro XML
│ ├── generate_itt.py # ITT subtitles
│ ├── generate_markdown.py # Readable transcripts
│ └── run_all_tests.py # Automated testing
├── input/ # Input media files
├── output/ # Generated results
├── tests/ # Unit tests
├── docs/ # Documentation
└── sample_inputs/ # Sample files
- Noise reduction using bandpass filtering
- 16kHz resampling for optimal Whisper performance
- Automatic fallback if FFmpeg unavailable
- Context Correction: Fixes grammar and homophones using GPT-4
- Translation: Supports 8+ languages with natural translation
- Cost transparent: Clear estimates before processing
- FCPXML: Professional subtitles for Final Cut Pro
- ITT: Standard subtitles for video players
- Markdown: Human-readable transcripts with speakers/timecodes
- JSON: Raw data for custom processing
# Run all automated tests
python scripts/run_all_tests.py
# Run unit tests
python -m pytest tests/
# Test individual modules
python scripts/transcribe.py --help# If torch installation fails, try installing separately:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
# For GPU support (NVIDIA):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118- macOS:
brew install ffmpeg - Ubuntu/Debian:
sudo apt update && sudo apt install ffmpeg - Windows: Download from ffmpeg.org and add to PATH
# If pyannote.audio fails to load models:
# 1. Ensure you have a valid Hugging Face token
# 2. Accept the model license at: https://huggingface.co/pyannote/speaker-diarization-3.1
# 3. Check token permissions include model access- Large Whisper models: Use smaller models (tiny/base) for limited RAM
- Speaker diarization: Disable if encountering OOM errors
- Long audio files: Process in shorter segments
# If model downloads fail:
# 1. Check internet connection
# 2. Verify Hugging Face token is valid
# 3. Try downloading models manually:
python -c "import whisper; whisper.load_model('base')"- Interactive Pipeline: Run
python run_transcription_pipeline_v2.pyfor guided setup - Module Help: Each script has
--helpfor detailed options - Cost Estimation: AI features show cost estimates before processing
- Error Handling: Graceful fallbacks and clear error messages
The modular architecture makes it easy to:
- Add new output formats
- Integrate different AI models
- Customize processing steps
- Add new languages
MIT License - see LICENSE file for details.
Transcriber v2.0 - Professional transcription made simple and modular.