I revisit my earlier critique of open-source text-to-speech and explain why the landscape has changed with Qwen3-TTS: open weights, strong inference via vLLM-Omni, voice cloning that actually works, and fewer artifacts on long passages. I share notes on fine-tuning a 0.6B base model with a small custom dataset, show sample outputs, and flag remaining caveats like a fine-tuning bug and limited style guidance—concluding that TTS finally doesn’t suck anymore.
Relevant links:
Original articleEarlier rant: open source TTS still suckedQwen (AI lab)Qwen3-TTS (GitHub)Voxtral (Mistral)Reddit: Missing piece of Voxtral TTS to enable voice cloningvLLM-Omni docsFish-Speech license (beware)Speech Arena: Open Weights TTS leaderboardArtificial Analysis: Qwen model family entriesQwen3-TTS fine-tuning acceleration bug (PR)Qwen3-TTS base models lack "style-guidance"Fine-tuning dataset (Hugging Face)MacWhisperParakeet TDT 0.6B v2 (Hugging Face)Fine-tuning code tweaksScaleway H100podcaster package (GitHub)ModalCoolifyAudio sample: Chatterbox (original)Fine-tuned Qwen3-TTS 0.6B model (Hugging Face)Audio sample: Qwen3-TTS fine-tuneOmniVoice (GitHub)Step-Audio-EditX (GitHub)