Mime

Universal speech, visualized in real-time

A high-performance, low-latency pipeline for real-time speech-to-speech translation with 3D facial animation support, designed for Windows/Zoom environments.

Getting Started

1. Prerequisites

Python 3.10+
FFmpeg (Required for audio processing)
Git LFS (Required to pull large model weights and assets)
NVIDIA GPU (Highly recommended for low-latency inference)
Windows OS (Required for Zoom bridge functionality)

2. Environment Setup

We use uv for fast, reproducible dependency management.

# 1. Install Git LFS (if not already installed)
# Windows: download from git-lfs.github.com or 'git lfs install'
# macOS: brew install git-lfs
git lfs install

# 2. Clone the repository
git clone https://github.com/2026W-COMP4107/Mime.git
cd Mime

# 3. Pull Large File Storage (LFS) assets
git lfs pull

# 4. Setup virtual environment and dependencies
# If you don't have uv: pip install uv
uv sync

Optional: GPU & CUDA Optimization

If you need specific CUDA-enabled PyTorch wheels (e.g., +cu118), the pyproject.toml is pre-configured to route to the PyTorch CUDA index.

# Refresh and sync locked GPU dependencies
uv lock --refresh
uv sync

# Quick GPU verification
uv run python -c "import torch; print(f'Torch: {torch.__version__} | CUDA: {torch.version.cuda} | Available: {torch.cuda.is_available()}')"

3. Configuration

Create a .env file in the root directory:

# Groq (ASR & MT)
GROQ_API_KEY=gsk_your_key_here

# Inworld (TTS)
INWORLD_API_KEY=your_base64_key_here

# Hugging Face (Model Access)
HF_TOKEN=hf_your_token_here

Project Structure

mime/
├── assets/          # 3D meshes (avatar.glb) and LFS tracked assets
├── data/            # BEAT dataset (downloaded separately)
├── models/          # Trained checkpoints (.pt files)
├── notebooks/       # Lip-sync training & grid search (ABS_train.ipynb)
├── reports/         # Proposal and final documentation
├── src/
│   ├── utils/       # Engines: ASR (Whisper), MT (LLaMA), TTS (Inworld)
│   ├── client_main.py # Main entry point
│   └── sts_main.py   # Standalone STS runner
└── .env             # Local environment secrets

Zoom Mime Client (Windows)

This bridge routes generated avatar video and TTS audio directly into Zoom via a virtual camera and cable.

1. Windows Dependencies

OBS Studio: Provides the virtual camera backend.
VB-CABLE: Virtual audio cable for routing TTS to Zoom input.
Zoom Desktop Client.

2. Launching the Pipeline

# Activate environment
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Launch with Zoom bridge enabled
python src/client_main.py --enable-zoom-bridge --audio-device-name "CABLE Input"

3. Zoom Configuration

Video: Select the OBS Virtual Camera (via pyvirtualcam).
Audio: Select CABLE Output (VB-Audio) as your Microphone.

Training & Dataset

1. Download BEAT Dataset

# Install HF CLI
# Windows (PS): powershell -ExecutionPolicy ByPass -c "irm https://hf.co/cli/install.ps1 | iex"
hf download H-Liu1997/BEAT --repo-type dataset --local-dir data

2. Train Lip-Sync Model

uv add --dev ipykernel
jupyter notebook notebooks/ABS_train.ipynb

The notebook supports Custom CNN or Wav2Vec 2.0 backbones. Monitor progress via TensorBoard:

tensorboard --logdir=logs

Troubleshooting

Missing Models: Ensure you ran git lfs pull after cloning.
Triton Errors (Windows): Run uv lock --refresh then uv sync to fix platform-specific wheel issues.
No Video in Zoom: Ensure OBS is installed and no other app is locking the virtual camera device.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.cache		.cache
assets		assets
logs		logs
models		models
notebooks		notebooks
reports		reports
resources		resources
results		results
src		src
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
torch.txt		torch.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mime

Universal speech, visualized in real-time

Getting Started

1. Prerequisites

2. Environment Setup

Optional: GPU & CUDA Optimization

3. Configuration

Project Structure

Zoom Mime Client (Windows)

1. Windows Dependencies

2. Launching the Pipeline

3. Zoom Configuration

Training & Dataset

1. Download BEAT Dataset

2. Train Lip-Sync Model

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mime

Universal speech, visualized in real-time

Getting Started

1. Prerequisites

2. Environment Setup

Optional: GPU & CUDA Optimization

3. Configuration

Project Structure

Zoom Mime Client (Windows)

1. Windows Dependencies

2. Launching the Pipeline

3. Zoom Configuration

Training & Dataset

1. Download BEAT Dataset

2. Train Lip-Sync Model

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages