Share #67 TTS doesn't suck anymore

Copy link

May 25, 2026

#67 TTS doesn't suck anymore

I revisit my earlier critique of open-source text-to-speech and explain why the landscape has changed with Qwen3-TTS: open weights, strong inference via vLLM-Omni, voice cloning that actually works, and fewer artifacts on long passages. I share notes on fine-tuning a 0.6B base model with a small custom dataset, show sample outputs, and flag remaining caveats like a fine-tuning bug and limited style guidance—concluding that TTS finally doesn’t suck anymore.

Relevant links:

Original article

Earlier rant: open source TTS still sucked

Qwen (AI lab)

Qwen3-TTS (GitHub)

Voxtral (Mistral)

Reddit: Missing piece of Voxtral TTS to enable voice cloning

vLLM-Omni docs

Fish-Speech license (beware)

Speech Arena: Open Weights TTS leaderboard

Artificial Analysis: Qwen model family entries

Qwen3-TTS fine-tuning acceleration bug (PR)

Qwen3-TTS base models lack "style-guidance"

Fine-tuning dataset (Hugging Face)

MacWhisper

Parakeet TDT 0.6B v2 (Hugging Face)

Fine-tuning code tweaks

Scaleway H100

podcaster package (GitHub)

Modal

Coolify

Audio sample: Chatterbox (original)

Fine-tuned Qwen3-TTS 0.6B model (Hugging Face)

Audio sample: Qwen3-TTS fine-tune

OmniVoice (GitHub)

Step-Audio-EditX (GitHub)

...more

View all episodes

By Duarte O.Carmo