This directory provides a complete finetuning workflow built on the MossTTSDelay architecture:
prepare_data.py: pre-extract target audioaudio_codes, with rank-sharded output supportdataset.py: packtext / instruction / ambient_sound / referenceand related fields into teacher-forcing samplessft.py: supports single-GPU, data parallel training, and 8B-scale FSDP / DeepSpeed ZeRO-3 sharded trainingconvert_seed_tts_eval_to_jsonl.py: convertseed-tts-evalfolders into training JSONLrun_train.sh: one-click launcher
Install training dependencies first:
git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,finetune]"If your environment supports FlashAttention 2, you can also follow the installation notes in the root README.
If you plan to use DeepSpeed ZeRO-3, install the extra dependency group as well:
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,finetune-deepspeed]"All tasks share the same basic idea:
audio: target training audio path;prepare_data.pywill encode it intoaudio_codes- all other fields are mapped directly into
processor.build_user_message(...)
This format does not require reference audio and is supported directly:
{"audio":"./data/utt0001.wav","text":"Actually, I noticed that I am very sensitive to other people's emotions.","language":"en"}
{"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","language":"en"}{"audio":"./data/utt0001.wav","text":"Actually, I noticed that I am very sensitive to other people's emotions.","ref_audio":"./data/ref.wav","language":"en"}
{"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","ref_audio":"./data/ref.wav","language":"en"}MOSS-TTSD shares the same prepare_data.py / sft.py pipeline as MOSS-TTS, and the format can stay the same.
The only difference is that reference may be a multi-speaker list, and list elements may be null, meaning that speaker has no cloning reference:
{
"audio":"./data/dialog_target.wav",
"text":"[S1] This is the prefix from speaker one. [S2] This is the prefix from speaker two. [S1] Now continue the next turn.",
"reference":["./data/s1_ref.wav", null],
"language":"en"
}Notes:
prepare_data.pyalways encodesaudio- by default it also encodes any reference audio found in
reference/ref_audio/reference_audio nullentries insidereferenceare preserved asNoneduring training and will not be encoded incorrectly- no extra
prompt_audiofield is required; autoregressive continuation is already learned through standard teacher-forcing training
If you use OpenMOSS-Team/MOSS-TTSD-v1.0 as the base model, the safest and simplest workflow is: replace the support .py files under this repo's moss_tts_delay folder with the corresponding versions from the OpenMOSS-Team/MOSS-TTSD-v1.0 Hugging Face repository before preprocessing, training, and inference.
To be specific, replacing:
processing_moss_tts.pymodeling_moss_tts.pyconfiguration_moss_tts.pyinference_utils.py
The reason is that the prompt template and some implementation details in OpenMOSS-Team/MOSS-TTSD-v1.0 are not exactly the same as the default moss_tts_delay version in this repo. If you mix them, a common failure mode is that the training loss looks normal but inference collapses into gibberish.
Also note that OpenMOSS-Team/MOSS-TTSD-v1.0 uses n_vq = 16, so we recommend explicitly passing --n-vq 16 in both preprocessing and training to keep data preparation, training, and inference consistent.
MOSS-SoundEffect uses the same pipeline, with ambient_sound as the user-side field:
{"audio":"./data/rain.wav","ambient_sound":"Rolling thunder with steady rainfall."}
{"audio":"./data/footsteps.wav","ambient_sound":"Clear footsteps echoing on concrete at a steady rhythm.","tokens":160}MOSS-VoiceGenerator also shares the same training flow, using text + instruction:
{"audio":"./data/old_man.wav","text":"My old back is really giving me trouble these days.","instruction":"A tired, hoarse elderly voice complaining slowly with a faint groan."}
{"audio":"./data/tavern.wav","text":"Hey there, stranger!","instruction":"Hearty, jovial tavern owner's voice, loud and welcoming with a slightly gruff tone."}python moss_tts_delay/finetuning/prepare_data.py \
--model-path OpenMOSS-Team/MOSS-TTS \
--codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \
--device auto \
--input-jsonl train_raw.jsonl \
--output-jsonl train_with_codes.jsonlBy default, prepare_data.py pre-encodes reference audio as well. If you only want target audio codes, disable it explicitly:
python moss_tts_delay/finetuning/prepare_data.py \
--model-path OpenMOSS-Team/MOSS-TTS \
--codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \
--device auto \
--input-jsonl train_raw.jsonl \
--output-jsonl train_with_codes.jsonl \
--skip-reference-audio-codesprepare_data.py now follows the accelerate launch multi-process model directly.
For example, with 2 nodes and 16 GPUs in total, the dataset is split into 16 shards and each rank writes one shard:
accelerate launch --num_processes 16 moss_tts_delay/finetuning/prepare_data.py \
--model-path OpenMOSS-Team/MOSS-TTS \
--codec-path OpenMOSS-Team/MOSS-Audio-Tokenizer \
--device auto \
--input-jsonl train_raw.jsonl \
--output-jsonl prepared/train_with_codes.jsonlThe output will look like:
prepared/train_with_codes.rank00000-of-00016.jsonlprepared/train_with_codes.rank00001-of-00016.jsonl- ...
prepared/train_with_codes.rank00015-of-00016.jsonl
During training, sft.py can read:
- a single JSONL
- a directory
- a glob such as
prepared/train_with_codes.rank*.jsonl - or a comma-separated list of files
If your platform already injects distributed communication environment variables, accelerate launch will reuse them directly, so you usually do not need to write torchrun-style communication arguments yourself.
accelerate launch moss_tts_delay/finetuning/sft.py \
--model-path OpenMOSS-Team/MOSS-TTS \
--train-jsonl train_with_codes.jsonl \
--output-dir output/moss_tts_sft \
--per-device-batch-size 1 \
--gradient-accumulation-steps 8 \
--learning-rate 1e-5 \
--warmup-ratio 0.03 \
--num-epochs 3 \
--mixed-precision bf16 \
--channelwise-loss-weight 1,32 \
--gradient-checkpointingFor single-node 8-GPU data parallel training, you can use:
accelerate launch \
--config_file moss_tts_delay/finetuning/configs/accelerate_ddp_8gpu.yaml \
moss_tts_delay/finetuning/sft.py \
--model-path OpenMOSS-Team/MOSS-TTS \
--train-jsonl 'prepared/train_with_codes.rank*.jsonl' \
--output-dir output/moss_tts_sft_ddp \
--per-device-batch-size 1 \
--gradient-accumulation-steps 4 \
--mixed-precision bf16 \
--channelwise-loss-weight 1,32 \
--gradient-checkpointingFor the 8B MOSS-TTS model, the following approaches are recommended over naive single-card training:
- FSDP: shard parameters, gradients, and optimizer states across ranks
- DeepSpeed ZeRO-3: fully shard parameters, gradients, and optimizer states; better suited for larger models and multi-node setups
accelerate launch \
--config_file moss_tts_delay/finetuning/configs/accelerate_fsdp_8b.yaml \
moss_tts_delay/finetuning/sft.py \
--model-path OpenMOSS-Team/MOSS-TTS \
--train-jsonl 'prepared/train_with_codes.rank*.jsonl' \
--output-dir output/moss_tts_sft_fsdp \
--per-device-batch-size 1 \
--gradient-accumulation-steps 4 \
--mixed-precision bf16 \
--channelwise-loss-weight 1,32 \
--gradient-checkpointingaccelerate launch \
--config_file moss_tts_delay/finetuning/configs/accelerate_zero3_8b.yaml \
moss_tts_delay/finetuning/sft.py \
--model-path OpenMOSS-Team/MOSS-TTS \
--train-jsonl 'prepared/train_with_codes.rank*.jsonl' \
--output-dir output/moss_tts_sft_zero3 \
--per-device-batch-size 1 \
--gradient-accumulation-steps 4 \
--mixed-precision bf16 \
--channelwise-loss-weight 1,32 \
--gradient-checkpointingZeRO-3 requires the deepspeed package. If you only use DDP or FSDP, you do not need it.
sft.py now exposes the common training hyperparameters directly:
- optimizer:
--learning-rate,--weight-decay,--adam-beta1,--adam-beta2,--adam-eps - LR schedule:
--lr-scheduler-type,--warmup-steps,--warmup-ratio - stability:
--max-grad-norm,--gradient-checkpointing,--mixed-precision - RVQ multi-head loss weighting:
--channelwise-loss-weight
--channelwise-loss-weight supports two forms:
n_vq + 1values:text_head,vq0,...,vqN- two values:
text_weight,total_audio_weight
The default is 1,32, which means the text head and each individual audio head have equal weight.
Training logs now print:
- timestamped log prefixes
global_batch_sizeand its formulastep_timesteps_per_secsamples_per_seceta
Update the following fields in the config file for your cluster:
num_machinesnum_processesmachine_rankmain_process_ipmain_process_port
For example, for 2 nodes and 16 GPUs:
- node 0:
machine_rank: 0 - node 1:
machine_rank: 1 num_machines: 2num_processes: 16
The training command itself can stay unchanged.
Each checkpoint saved by sft.py now contains model config, runtime Python files, tokenizer files, and processor metadata, so you can call from_pretrained directly on that checkpoint directory:
from pathlib import Path
import importlib.util
import torch
import torchaudio
from transformers import AutoProcessor
from moss_tts_delay.modeling_moss_tts import MossTTSDelayModel
torch.backends.cuda.enable_cudnn_sdp(False)
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)
def resolve_attn_implementation(device: str, dtype: torch.dtype) -> str:
if (
device == "cuda"
and importlib.util.find_spec("flash_attn") is not None
and dtype in {torch.float16, torch.bfloat16}
):
major, _ = torch.cuda.get_device_capability()
if major >= 8:
return "flash_attention_2"
if device == "cuda":
return "sdpa"
return "eager"
model_path = "output/moss_tts_sft/checkpoint-epoch-2"
reference_audio = "./assets/audio/reference_en_0.mp3"
text = "This is a quick finetuning smoke test for MOSS-TTS."
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32
attn_implementation = resolve_attn_implementation(device, dtype)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
model = MossTTSDelayModel.from_pretrained(
model_path,
torch_dtype=dtype,
attn_implementation=attn_implementation,
).to(device)
model.eval()
conversation = [[
processor.build_user_message(
text=text,
reference=[reference_audio],
)
]]
batch = processor(conversation, mode="generation")
outputs = model.generate(
input_ids=batch["input_ids"].to(device),
attention_mask=batch["attention_mask"].to(device),
max_new_tokens=4096,
)
message = processor.decode(outputs)[0]
audio = message.audio_codes_list[0]
Path("demo_outputs").mkdir(parents=True, exist_ok=True)
torchaudio.save("demo_outputs/finetuned_sample.wav", audio.unsqueeze(0), processor.model_config.sampling_rate)Run directly:
bash moss_tts_delay/finetuning/run_train.shCommon environment variables:
RAW_JSONL: raw training JSONLPREPARED_JSONL: output file fromprepare_data.pyTRAIN_JSONL: optional; training input, which can be a single file, directory, or glob. If unset, it is inferred automatically fromPREPARED_JSONLOUTPUT_DIR: training output directoryACCELERATE_CONFIG_FILE: optional; DDP / FSDP / ZeRO-3 config fileSKIP_PREPARE: set to1to skip preprocessing and train directly from existingTRAIN_JSONL/PREPARED_JSONLPREP_EXTRA_ARGS_STR: extra arguments passed toprepare_data.pyPREP_ACCELERATE_ARGS_STR: if you want preprocessing to also launch throughaccelerate, set this, for example--num_processes 16or--config_file moss_tts_delay/finetuning/configs/accelerate_ddp_8gpu.yamlTRAIN_EXTRA_ARGS_STR: extra arguments passed tosft.py
For example, to launch with ZeRO-3:
RAW_JSONL=train_raw.jsonl \
PREPARED_JSONL=prepared/train_with_codes.jsonl \
OUTPUT_DIR=output/moss_tts_sft_zero3 \
ACCELERATE_CONFIG_FILE=moss_tts_delay/finetuning/configs/accelerate_zero3_8b.yaml \
PREP_ACCELERATE_ARGS_STR='--config_file moss_tts_delay/finetuning/configs/accelerate_ddp_8gpu.yaml' \
PREP_EXTRA_ARGS_STR='' \
TRAIN_EXTRA_ARGS_STR='--per-device-batch-size 1 --gradient-accumulation-steps 4 --num-epochs 3 --warmup-ratio 0.03 --mixed-precision bf16 --channelwise-loss-weight 1,32 --gradient-checkpointing' \
bash moss_tts_delay/finetuning/run_train.shThe remaining tasks do not require a separate trainer. You only need to switch the JSONL fields:
- MOSS-TTS: use
text, optionallyref_audio - MOSS-TTSD: use
text + reference, wherereferencesupportsnull - MOSS-SoundEffect: use
ambient_sound - MOSS-VoiceGenerator: use
text + instruction
Shared fields:
audio: required target audiolanguage,tokens,quality,sound_event,ambient_sound,instruction: fill them as needed by the task
Shared scripts:
- use
prepare_data.pyfor data preparation - use
sft.pyfor training train-jsonlsupports a single file, directory, glob, or multi-file list