
Sign up to save your podcasts
Or
How do you teach a sophisticated speech AI to understand and discuss images, especially when paired image-speech data is rare?
This episode unpacks MoshiVis, a new model that achieves just that. We explore the challenges of building Vision-Speech Models and how MoshiVis overcomes them with a unique one-stage training pipeline, synthetic dialogues, and efficient "perceptual augmentation" techniques built upon the Moshi speech LLM.
Join us for a deep dive into the tech that lets AI see, speak, and converse fluidly about the visual world.
How do you teach a sophisticated speech AI to understand and discuss images, especially when paired image-speech data is rare?
This episode unpacks MoshiVis, a new model that achieves just that. We explore the challenges of building Vision-Speech Models and how MoshiVis overcomes them with a unique one-stage training pipeline, synthetic dialogues, and efficient "perceptual augmentation" techniques built upon the Moshi speech LLM.
Join us for a deep dive into the tech that lets AI see, speak, and converse fluidly about the visual world.