Meet Pocket TTS: Real-Time Voice AI That Runs on a Laptop
Most AI voice models demand expensive GPUs and cloud APIs to generate speech. Not ideal if you’re building a voice assistant or just want to clone your voice without burning through compute credits.
Kyutai just released Pocket TTS, a text-to-speech model so small (100 million parameters) it runs faster than real-time on your CPU — no fancy GPU needed.
The model delivers high-quality voice cloning using just 5 seconds of audio. Give it 5 seconds of someone’s voice, and it’ll clone their tone, accent, emotion, and even the room acoustics and microphone quality.
Kinda like how your nephew can do a perfect impression of that one annoying TikTok video on repeat, so now you can do it too. Does anyone else’s extended family ban the phrase “6-7” after last year’s Thanksgiving?
The numbers speak for themselves
- Best-in-class accuracy: Lowest Word Error Rate (1.84%) among competitors — including models 7x larger.
- Truly portable: Runs on Apple M3 or Intel Core Ultra CPUs without dedicated graphics.
- Open everything: Fully open-source under MIT license with full training code and 88k hours of public data.
The breakthrough comes from Continuous Audio Language Models (CALM), a new framework that predicts audio directly rather than first converting it into discrete tokens. This eliminates the computational bottleneck that made previous TTS models GPU-dependent.
Why this matters
Voice AI just became accessible to any developer (or even you) with a laptop (no more need for an expensive ElevenLabs subscription, tho don’t cry for them; they just hit $330 million in ARR, which = annualized recurring revenue).
What you can do today that was impossible yesterday:
- A solo game developer can add 50 unique character voices without hiring a single actor or paying for cloud API calls
- Someone with ALS can bank their voice on a laptop before it deteriorates, keeping their identity in a private file they control.
- A language teacher creates pronunciation guides in their own voice across 200 vocabulary words in an afternoon.
The privacy angle matters most. Until now, voice cloning meant sending audio to someone else’s servers. Medical dictation, legal depositions, confidential business communications; all required trusting a third party. Now? Your voice never leaves your machine.
Developers can start using Pocket TTS immediately; if you wanna try it yourself, the full technical report from Kyutai includes setup instructions and voice samples.
Editor’s note: This content originally ran in the newsletter of our sister publication, The Neuron. To read more from The Neuron, sign up for its newsletter here.
The post Meet Pocket TTS: Real-Time Voice AI That Runs on a Laptop appeared first on eWEEK.