Fine-Tuning MossTTSDelay

This directory provides a complete finetuning workflow built on the MossTTSDelay architecture:

prepare_data.py: pre-extract target audio audio_codes, with rank-sharded output support
dataset.py: pack text / instruction / ambient_sound / reference and related fields into teacher-forcing samples
sft.py: supports single-GPU, data parallel training, and 8B-scale FSDP / DeepSpeed ZeRO-3 sharded training
convert_seed_tts_eval_to_jsonl.py: convert seed-tts-eval folders into training JSONL
run_train.sh: one-click launcher

1. Install

Install training dependencies first:

git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,finetune]"

If your environment supports FlashAttention 2, you can also follow the installation notes in the root README.

If you plan to use DeepSpeed ZeRO-3, install the extra dependency group as well:

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,finetune-deepspeed]"

2. Input JSONL format

All tasks share the same basic idea:

audio: target training audio path; prepare_data.py will encode it into audio_codes
all other fields are mapped directly into processor.build_user_message(...)

2.1 MOSS-TTS

Plain `text, speech` pairs

This format does not require reference audio and is supported directly:

{"audio":"./data/utt0001.wav","text":"Actually, I noticed that I am very sensitive to other people's emotions.","language":"en"}
{"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","language":"en"}

Voice cloning / reference-conditioned training

{"audio":"./data/utt0001.wav","text":"Actually, I noticed that I am very sensitive to other people's emotions.","ref_audio":"./data/ref.wav","language":"en"}
{"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","ref_audio":"./data/ref.wav","language":"en"}

2.2 MOSS-TTSD

MOSS-TTSD shares the same prepare_data.py / sft.py pipeline as MOSS-TTS, and the format can stay the same.
The only difference is that reference may be a multi-speaker list, and list elements may be null, meaning that speaker has no cloning reference:

{
  "audio":"./data/dialog_target.wav",
  "text":"[S1] This is the prefix from speaker one. [S2] This is the prefix from speaker two. [S1] Now continue the next turn.",
  "reference":["./data/s1_ref.wav", null],
  "language":"en"
}

Notes:

prepare_data.py always encodes audio
by default it also encodes any reference audio found in reference / ref_audio / reference_audio
null entries inside reference are preserved as None during training and will not be encoded incorrectly
no extra prompt_audio field is required; autoregressive continuation is already learned through standard teacher-forcing training

If you use OpenMOSS-Team/MOSS-TTSD-v1.0 as the base model, the safest and simplest workflow is: replace the support .py files under this repo's moss_tts_delay folder with the corresponding versions from the OpenMOSS-Team/MOSS-TTSD-v1.0 Hugging Face repository before preprocessing, training, and inference.

To be specific, replacing:

processing_moss_tts.py
modeling_moss_tts.py
configuration_moss_tts.py
inference_utils.py

The reason is that the prompt template and some implementation details in OpenMOSS-Team/MOSS-TTSD-v1.0 are not exactly the same as the default moss_tts_delay version in this repo. If you mix them, a common failure mode is that the training loss looks normal but inference collapses into gibberish.

Also note that OpenMOSS-Team/MOSS-TTSD-v1.0 uses n_vq = 16, so we recommend explicitly passing --n-vq 16 in both preprocessing and training to keep data preparation, training, and inference consistent.

2.3 MOSS-SoundEffect

MOSS-SoundEffect uses the same pipeline, with ambient_sound as the user-side field:

{"audio":"./data/rain.wav","ambient_sound":"Rolling thunder with steady rainfall."}
{"audio":"./data/footsteps.wav","ambient_sound":"Clear footsteps echoing on concrete at a steady rhythm.","tokens":160}

2.4 MOSS-VoiceGenerator

MOSS-VoiceGenerator also shares the same training flow, using text + instruction:

{"audio":"./data/old_man.wav","text":"My old back is really giving me trouble these days.","instruction":"A tired, hoarse elderly voice complaining slowly with a faint groan."}
{"audio":"./data/tavern.wav","text":"Hey there, stranger!","instruction":"Hearty, jovial tavern owner's voice, loud and welcoming with a slightly gruff tone."}

3. Prepare data

3.1 Single process

python moss_tts_delay/finetuning/prepare_data.py \
    --model-path OpenMOSS-Team/MOSS-TTS \
    --codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \
    --device auto \
    --input-jsonl train_raw.jsonl \
    --output-jsonl train_with_codes.jsonl

By default, prepare_data.py pre-encodes reference audio as well. If you only want target audio codes, disable it explicitly:

python moss_tts_delay/finetuning/prepare_data.py \
    --model-path OpenMOSS-Team/MOSS-TTS \
    --codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \
    --device auto \
    --input-jsonl train_raw.jsonl \
    --output-jsonl train_with_codes.jsonl \
    --skip-reference-audio-codes

3.2 Multi-node / multi-GPU parallel preprocessing

prepare_data.py now follows the accelerate launch multi-process model directly.
For example, with 2 nodes and 16 GPUs in total, the dataset is split into 16 shards and each rank writes one shard:

accelerate launch --num_processes 16 moss_tts_delay/finetuning/prepare_data.py \
    --model-path OpenMOSS-Team/MOSS-TTS \
    --codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \
    --device auto \
    --input-jsonl train_raw.jsonl \
    --output-jsonl prepared/train_with_codes.jsonl

The output will look like:

prepared/train_with_codes.rank00000-of-00016.jsonl
prepared/train_with_codes.rank00001-of-00016.jsonl
...
prepared/train_with_codes.rank00015-of-00016.jsonl

During training, sft.py can read:

a single JSONL
a directory
a glob such as prepared/train_with_codes.rank*.jsonl
or a comma-separated list of files

If your platform already injects distributed communication environment variables, accelerate launch will reuse them directly, so you usually do not need to write torchrun-style communication arguments yourself.

4. Train

4.1 Single-GPU baseline

accelerate launch moss_tts_delay/finetuning/sft.py \
    --model-path OpenMOSS-Team/MOSS-TTS \
    --train-jsonl train_with_codes.jsonl \
    --output-dir output/moss_tts_sft \
    --per-device-batch-size 1 \
    --gradient-accumulation-steps 8 \
    --learning-rate 1e-5 \
    --warmup-ratio 0.03 \
    --num-epochs 3 \
    --mixed-precision bf16 \
    --channelwise-loss-weight 1,32 \
    --gradient-checkpointing

4.2 Data parallel

For single-node 8-GPU data parallel training, you can use:

accelerate launch \
    --config_file moss_tts_delay/finetuning/configs/accelerate_ddp_8gpu.yaml \
    moss_tts_delay/finetuning/sft.py \
    --model-path OpenMOSS-Team/MOSS-TTS \
    --train-jsonl 'prepared/train_with_codes.rank*.jsonl' \
    --output-dir output/moss_tts_sft_ddp \
    --per-device-batch-size 1 \
    --gradient-accumulation-steps 4 \
    --mixed-precision bf16 \
    --channelwise-loss-weight 1,32 \
    --gradient-checkpointing

4.3 "Model parallel" / parameter-sharded training for the 8B model

For the 8B MOSS-TTS model, the following approaches are recommended over naive single-card training:

FSDP: shard parameters, gradients, and optimizer states across ranks
DeepSpeed ZeRO-3: fully shard parameters, gradients, and optimizer states; better suited for larger models and multi-node setups

FSDP

accelerate launch \
    --config_file moss_tts_delay/finetuning/configs/accelerate_fsdp_8b.yaml \
    moss_tts_delay/finetuning/sft.py \
    --model-path OpenMOSS-Team/MOSS-TTS \
    --train-jsonl 'prepared/train_with_codes.rank*.jsonl' \
    --output-dir output/moss_tts_sft_fsdp \
    --per-device-batch-size 1 \
    --gradient-accumulation-steps 4 \
    --mixed-precision bf16 \
    --channelwise-loss-weight 1,32 \
    --gradient-checkpointing

DeepSpeed ZeRO-3

accelerate launch \
    --config_file moss_tts_delay/finetuning/configs/accelerate_zero3_8b.yaml \
    moss_tts_delay/finetuning/sft.py \
    --model-path OpenMOSS-Team/MOSS-TTS \
    --train-jsonl 'prepared/train_with_codes.rank*.jsonl' \
    --output-dir output/moss_tts_sft_zero3 \
    --per-device-batch-size 1 \
    --gradient-accumulation-steps 4 \
    --mixed-precision bf16 \
    --channelwise-loss-weight 1,32 \
    --gradient-checkpointing

ZeRO-3 requires the deepspeed package. If you only use DDP or FSDP, you do not need it.

4.4 Common tunable hyperparameters

sft.py now exposes the common training hyperparameters directly:

optimizer: --learning-rate, --weight-decay, --adam-beta1, --adam-beta2, --adam-eps
LR schedule: --lr-scheduler-type, --warmup-steps, --warmup-ratio
stability: --max-grad-norm, --gradient-checkpointing, --mixed-precision
RVQ multi-head loss weighting: --channelwise-loss-weight

--channelwise-loss-weight supports two forms:

n_vq + 1 values: text_head,vq0,...,vqN
two values: text_weight,total_audio_weight

The default is 1,32, which means the text head and each individual audio head have equal weight.

Training logs now print:

timestamped log prefixes
global_batch_size and its formula
step_time
steps_per_sec
samples_per_sec
eta

4.5 Multi-node training

Update the following fields in the config file for your cluster:

num_machines
num_processes
machine_rank
main_process_ip
main_process_port

For example, for 2 nodes and 16 GPUs:

node 0: machine_rank: 0
node 1: machine_rank: 1
num_machines: 2
num_processes: 16

The training command itself can stay unchanged.

5. Quick inference test

Each checkpoint saved by sft.py now contains model config, runtime Python files, tokenizer files, and processor metadata, so you can call from_pretrained directly on that checkpoint directory:

from pathlib import Path
import importlib.util
import torch
import torchaudio
from transformers import AutoProcessor

from moss_tts_delay.modeling_moss_tts import MossTTSDelayModel

torch.backends.cuda.enable_cudnn_sdp(False)
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)


def resolve_attn_implementation(device: str, dtype: torch.dtype) -> str:
    if (
        device == "cuda"
        and importlib.util.find_spec("flash_attn") is not None
        and dtype in {torch.float16, torch.bfloat16}
    ):
        major, _ = torch.cuda.get_device_capability()
        if major >= 8:
            return "flash_attention_2"
    if device == "cuda":
        return "sdpa"
    return "eager"


model_path = "output/moss_tts_sft/checkpoint-epoch-2"
reference_audio = "./assets/audio/reference_en_0.mp3"
text = "This is a quick finetuning smoke test for MOSS-TTS."

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32
attn_implementation = resolve_attn_implementation(device, dtype)

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
processor.audio_tokenizer = processor.audio_tokenizer.to(device)

model = MossTTSDelayModel.from_pretrained(
    model_path,
    torch_dtype=dtype,
    attn_implementation=attn_implementation,
).to(device)
model.eval()

conversation = [[
    processor.build_user_message(
        text=text,
        reference=[reference_audio],
    )
]]

batch = processor(conversation, mode="generation")
outputs = model.generate(
    input_ids=batch["input_ids"].to(device),
    attention_mask=batch["attention_mask"].to(device),
    max_new_tokens=4096,
)

message = processor.decode(outputs)[0]
audio = message.audio_codes_list[0]
Path("demo_outputs").mkdir(parents=True, exist_ok=True)
torchaudio.save("demo_outputs/finetuned_sample.wav", audio.unsqueeze(0), processor.model_config.sampling_rate)

6. One-click launcher

Run directly:

bash moss_tts_delay/finetuning/run_train.sh

Common environment variables:

RAW_JSONL: raw training JSONL
PREPARED_JSONL: output file from prepare_data.py
TRAIN_JSONL: optional; training input, which can be a single file, directory, or glob. If unset, it is inferred automatically from PREPARED_JSONL
OUTPUT_DIR: training output directory
ACCELERATE_CONFIG_FILE: optional; DDP / FSDP / ZeRO-3 config file
SKIP_PREPARE: set to 1 to skip preprocessing and train directly from existing TRAIN_JSONL / PREPARED_JSONL
PREP_EXTRA_ARGS_STR: extra arguments passed to prepare_data.py
PREP_ACCELERATE_ARGS_STR: if you want preprocessing to also launch through accelerate, set this, for example --num_processes 16 or --config_file moss_tts_delay/finetuning/configs/accelerate_ddp_8gpu.yaml
TRAIN_EXTRA_ARGS_STR: extra arguments passed to sft.py

For example, to launch with ZeRO-3:

RAW_JSONL=train_raw.jsonl \
PREPARED_JSONL=prepared/train_with_codes.jsonl \
OUTPUT_DIR=output/moss_tts_sft_zero3 \
ACCELERATE_CONFIG_FILE=moss_tts_delay/finetuning/configs/accelerate_zero3_8b.yaml \
PREP_ACCELERATE_ARGS_STR='--config_file moss_tts_delay/finetuning/configs/accelerate_ddp_8gpu.yaml' \
PREP_EXTRA_ARGS_STR='' \
TRAIN_EXTRA_ARGS_STR='--per-device-batch-size 1 --gradient-accumulation-steps 4 --num-epochs 3 --warmup-ratio 0.03 --mixed-precision bf16 --channelwise-loss-weight 1,32 --gradient-checkpointing' \
bash moss_tts_delay/finetuning/run_train.sh

7. Additional task format notes

The remaining tasks do not require a separate trainer. You only need to switch the JSONL fields:

MOSS-TTS: use text, optionally ref_audio
MOSS-TTSD: use text + reference, where reference supports null
MOSS-SoundEffect: use ambient_sound
MOSS-VoiceGenerator: use text + instruction

Shared fields:

audio: required target audio
language, tokens, quality, sound_event, ambient_sound, instruction: fill them as needed by the task

Shared scripts:

use prepare_data.py for data preparation
use sft.py for training
train-jsonl supports a single file, directory, glob, or multi-file list

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-Tuning MossTTSDelay

1. Install

2. Input JSONL format

2.1 MOSS-TTS

Plain `text, speech` pairs

Voice cloning / reference-conditioned training

2.2 MOSS-TTSD

2.3 MOSS-SoundEffect

2.4 MOSS-VoiceGenerator

3. Prepare data

3.1 Single process

3.2 Multi-node / multi-GPU parallel preprocessing

4. Train

4.1 Single-GPU baseline

4.2 Data parallel

4.3 "Model parallel" / parameter-sharded training for the 8B model

FSDP

DeepSpeed ZeRO-3

4.4 Common tunable hyperparameters

4.5 Multi-node training

5. Quick inference test

6. One-click launcher

7. Additional task format notes

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Fine-Tuning MossTTSDelay

1. Install

2. Input JSONL format

2.1 MOSS-TTS

Plain text, speech pairs

Voice cloning / reference-conditioned training

2.2 MOSS-TTSD

2.3 MOSS-SoundEffect

2.4 MOSS-VoiceGenerator

3. Prepare data

3.1 Single process

3.2 Multi-node / multi-GPU parallel preprocessing

4. Train

4.1 Single-GPU baseline

4.2 Data parallel

4.3 "Model parallel" / parameter-sharded training for the 8B model

FSDP

DeepSpeed ZeRO-3

4.4 Common tunable hyperparameters

4.5 Multi-node training

5. Quick inference test

6. One-click launcher

7. Additional task format notes

Plain `text, speech` pairs