Moshi AI
Moshi AI is a speech-native conversational model from Kyutai, a Paris-based open-science research lab. Instead of chaining speech recognition, text generation, and text-to-speech, Moshi processes audio directly and holds full-duplex voice conversations with minimal latency.
Its multi-stream design runs separate channels for the user, Moshi's spoken output, and an Inner Monologue text stream that improves coherence. That setup lets Moshi listen and talk at the same time, handle overlaps, interruptions, and backchanneling like a real conversation rather than rigid speaker turns.
Moshi is built on Helium, a 7B language model, and Mimi, Kyutai's neural audio codec. Weights and inference code ship for PyTorch, Rust, and MLX, and you can try it in the browser at moshi-chat.kyutai.org. Researchers, voice AI developers, and anyone building real-time spoken interfaces will find the most value here.
Processes speech directly without a text pipeline in the middle
Listens and talks simultaneously with overlap and interruption support
Inner Monologue text stream improves speech quality and reasoning
Runs real-time on an L4 GPU or M3 MacBook Pro via the Mimi codec
Open weights on Hugging Face with PyTorch, Rust, and MLX inference code
First open full-duplex speech-to-speech model with publicly released weights and code
Low latency around 200ms in practice thanks to the Mimi codec at 12.5 Hz
Handles natural conversation dynamics like interruptions and backchanneling
Runs locally on consumer hardware including M3 MacBook Pro and Nvidia L4 GPUs
Browser demo caps conversations at five minutes per session
Experimental status means responses can be unreliable or nonsensical
No managed cloud API; self-hosting requires capable GPU hardware
Is Moshi AI free to use?
Yes. Moshi AI is open source with model weights and inference code released on GitHub and Hugging Face. The online demo at moshi-chat.kyutai.org is free to try, with conversations capped at five minutes per session.
Who developed Moshi AI?
Moshi AI was developed by Kyutai, a nonprofit open-science AI research lab based in Paris. Kyutai is funded by Iliad Group, CMA CGM Group, and Schmidt Sciences.
How is Moshi AI different from typical voice assistants?
Most voice assistants use turn-based pipelines that convert speech to text, generate a reply, then synthesize audio. Moshi AI is speech-native: it generates audio tokens directly and supports full-duplex dialogue where both sides can speak at once.
Can I run Moshi AI locally?
Yes. Kyutai released Moshi model weights along with streaming inference code in PyTorch, Rust, and MLX. The release blog notes real-time performance on an Nvidia L4 GPU or an M3 MacBook Pro.
Does Moshi AI support images?
MoshiVis extends Moshi to discuss images in real time while keeping the same low-latency conversation flow. A separate demo is available at vis.moshi.chat, with weights and code on GitHub.
What are the demo limitations on moshi-chat.kyutai.org?
The Moshi AI browser demo is experimental and limits each conversation to five minutes. Kyutai notes that Chrome provides the best experience, and users should treat generated responses with caution.

