VoiceForge is a browser-based assistive video tool that lets a user type during calls and output cloned speech with a lip-synced face preview.
- Why This Exists
- Tech Stack
- Browser Compatibility
- Prerequisites
- Setup
- Environment Variables
- Using VoiceForge In A Call
- OBS Virtual Camera Setup
- API
- Roadmap
- License
- About
Deaf and speech-impaired people on video calls are often pushed into chat boxes, delayed interpretation, or awkward turn-taking. VoiceForge explores a local-first interface where typed intent can become spoken audio and a synchronized visual feed, helping the user participate in the same conversational channel as everyone else.
VoiceForge targets Chrome and Edge only. WebRTC Insertable Streams and canvas capture APIs are still uneven across browsers, so Firefox and Safari are not supported for the virtual camera MVP.
VoiceForge's voice cloning engine is 100% free — no paid API plan, no account sign-up, and no API key required.
It is powered by ResembleAI/Chatterbox-Multilingual-TTS, a production-grade multilingual voice cloning model hosted as a public Hugging Face Space. The server connects to it using the official @gradio/client bridge package, which is installed automatically with npm install.
What you need:
- Node.js 18 or newer
- npm 9 or newer
- Chrome or Edge (for the virtual camera feature)
- An internet connection when running in live mode (see Environment Variables for offline mock mode)
- Install Node.js 18 or newer.
- From the repository root, install all dependencies (this includes
@gradio/client):
npm install- Copy the example environment file:
cp .env.example .env- (Optional) Open
.envand review the settings. The defaults run in offline mock mode, so no API key or internet access is needed. See Environment Variables for the full reference. - Start the client and server together:
npm run dev- Open
http://localhost:5173in Chrome or Edge.
All variables live in your local .env file (copy from .env.example). None of them require a paid account or API key.
| Variable | Default | Description |
|---|---|---|
VOICE_ENGINE_SPACE |
(commented out) | The Hugging Face Gradio space used for voice synthesis. See the dual-mode setup below. |
MOCK_CHATTERBOX |
true |
Controls whether the live AI or an offline test stub is used. See below. |
PORT |
3001 |
Express API port. |
CLIENT_URL |
http://localhost:5173 |
Allowed CORS origin for the Vite dev server. |
STREAM_SECRET |
(auto-generated) | AES-256-GCM signing key for speech stream tokens. Set a fixed value to survive server restarts. |
VoiceForge ships with two engine routing modes that you control entirely from .env:
The checked-in .env.example uses mock mode by default:
MOCK_CHATTERBOX=trueThis skips all Hugging Face network calls. The server returns a fixture voice_id instantly on clone and streams a short silent audio file on speak. This is ideal for contributors working on UI changes, automated CI pipelines, or offline environments.
Safety:
MOCK_CHATTERBOX=truehas no effect whenNODE_ENV=production. The server logs a yellow warning at startup whenever mock mode is active so it can never be silently enabled.
Leave VOICE_ENGINE_SPACE commented out with a #. The server will automatically route all synthesis requests to the official, lightning-fast production space:
# VOICE_ENGINE_SPACE=ResembleAI/Chatterbox-Multilingual-TTS
MOCK_CHATTERBOX=falseThis is the recommended setting for end-users and deployed environments.
If the official space is temporarily busy or you prefer to route through an independent mirror, uncomment the line and point it at the community-maintained backup:
VOICE_ENGINE_SPACE=itzzavdheshh/voiceforge-engine
MOCK_CHATTERBOX=falseThis mirror runs the same Chatterbox Multilingual model. Useful when the primary space is under heavy load or during extended development sessions.
- Open VoiceForge in Chrome or Edge.
- Record a 10-second consent-based reference clip.
- Clone the voice and continue to the Call page.
- Allow webcam access.
- Type a phrase and press Enter or Speak.
- Turn on Go Live to expose the canvas stream inside the browser.
- In Zoom, Google Meet, or Microsoft Teams, open camera settings and select the virtual camera source you have configured.
Most video call apps cannot directly select a browser tab as a system camera. For the MVP, install OBS Studio and use OBS Virtual Camera as the bridge.
-
Install OBS Studio.
-
Add a Browser Source pointing to
http://localhost:5173. Set the width to 1920 and height to 1080 to capture the full interface. -
Crop the source to focus on the lip-synced output preview.
-
Click Start Virtual Camera in the OBS Controls panel.
-
Select OBS Virtual Camera as your camera in your preferred video call application.
Zoom: Go to Settings > Video > Camera and select OBS Virtual Camera.
Google Meet: Go to Settings > Video > Camera and select OBS Virtual Camera.
Microsoft Teams: Go to Settings > Devices > Camera and select OBS Virtual Camera.
For detailed setup guides (including Discord and Webex) and troubleshooting tips, see our Virtual Camera Guide.
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/voice/clone |
Upload reference audio. Stores it server-side and returns a voice_id. No external API call in mock mode. |
POST |
/api/voice/speak |
Send text, voice_id, and optional voice settings. Returns a signed speechId and streaming audioUrl. |
GET |
/api/voice/speak/stream?t=<speechId> |
Stream the Chatterbox-generated audio for a pending signed speech token (t). Proxied from the Hugging Face Space. |
GET |
/api/voice/status |
Returns current engine mode (isMock, space) for debugging. |
GET |
/api/health |
Returns local API health status. |
- Done: Store cloned voice profiles and reference audio Blobs in IndexedDB via
client/src/utils/db.js. - Done: Stream TTS audio through
POST /api/voice/speakandGET /api/voice/speak/stream. - Done: Replaced ElevenLabs with the free ResembleAI Chatterbox Multilingual TTS engine via
@gradio/client. - In progress: Voice tuning controls are wired through persisted
voice_settings; multilingual output supports 23 languages via Chatterbox, with dedicated language controls in the UI. - In progress: The MVP virtual camera uses canvas capture; full WebRTC Insertable Streams frame replacement remains future work.
- TODO: Replace the placeholder
models/wav2lip.onnxwith a real lightweight browser Wav2Lip ONNX model. - TODO: Implement real ONNX Runtime Web Wav2Lip inference.
- Done: Replace the fallback mouth animation with model-driven mouth movement.
- Done: Add richer virtual camera documentation for OBS and each call provider.
- TODO: Add automated browser tests for camera and microphone permission flows.
- TODO: Persist voice profiles across server restarts (database or object-store backend).
MIT




