Duarte O.Carmo's articles

#67 TTS doesn't suck anymore


Listen Later

I revisit my earlier critique of open-source text-to-speech and explain why the landscape has changed with Qwen3-TTS: open weights, strong inference via vLLM-Omni, voice cloning that actually works, and fewer artifacts on long passages. I share notes on fine-tuning a 0.6B base model with a small custom dataset, show sample outputs, and flag remaining caveats like a fine-tuning bug and limited style guidance—concluding that TTS finally doesn’t suck anymore.

Relevant links:

  • Original article
  • Earlier rant: open source TTS still sucked
  • Qwen (AI lab)
  • Qwen3-TTS (GitHub)
  • Voxtral (Mistral)
  • Reddit: Missing piece of Voxtral TTS to enable voice cloning
  • vLLM-Omni docs
  • Fish-Speech license (beware)
  • Speech Arena: Open Weights TTS leaderboard
  • Artificial Analysis: Qwen model family entries
  • Qwen3-TTS fine-tuning acceleration bug (PR)
  • Qwen3-TTS base models lack "style-guidance"
  • Fine-tuning dataset (Hugging Face)
  • MacWhisper
  • Parakeet TDT 0.6B v2 (Hugging Face)
  • Fine-tuning code tweaks
  • Scaleway H100
  • podcaster package (GitHub)
  • Modal
  • Coolify
  • Audio sample: Chatterbox (original)
  • Fine-tuned Qwen3-TTS 0.6B model (Hugging Face)
  • Audio sample: Qwen3-TTS fine-tune
  • OmniVoice (GitHub)
  • Step-Audio-EditX (GitHub)
  • ...more
    View all episodesView all episodes
    Download on the App Store

    Duarte O.Carmo's articlesBy Duarte O.Carmo