Text to Speech
Real neural voices that run inside your browser. Powered by Supertonic 3 from Supertone (HYBE), no audio uploads, no API keys. 10 voices × speed and tone presets × expression tags.
Voice (10 styles)
What is this?
A text-to-speech tool that generates real neural voice audio without sending your text anywhere. The voice model is Supertonic 3 by Supertone (a HYBE company), released as ONNX weights and run inside your browser via @huggingface/transformers. The first time you use it the model downloads (about 50 MB) and is cached locally; every visit after is instant.
How to use it
- Type or paste your text. Up to a few thousand characters works fine on a modern device.
- Pick a voice from the grid. Filter by Female / Male if you want.
- Pick a tone preset (Audiobook, News, Podcast, Kids, Meditation, Energetic) or fine tune speed and quality steps manually.
- Click Generate. The audio plays automatically and a player appears. Use Download WAV or Download MP3 to save it.
Voices
Ten neural voices, five female and five male, each with a distinct character. Combined with the seven tone presets, you have effectively 70 different voice flavors. Every voice speaks 31 languages out of the box, so the same M3 (Hugo) can read English, Hindi, Japanese, Spanish, French, and more.
Expression tags
Supertonic 3 was trained to recognize three inline tags and synthesize them as actual non-verbal sounds:
<laugh>— light laughter<breath>— an inhale or breath pause<sigh>— an exhale
Drop them inline in your text. Example: That's hilarious <laugh> let me try again <sigh>. Other tags like <cough> or <cry> are NOT in the model's training set and will be read aloud letter by letter.
Use them sparingly and they sound very human. Use them every line and it sounds forced.
Privacy and how it works
Your text never leaves your device. The neural voice model runs locally in your browser using either WebGPU (modern desktops, recent iPhones, recent Android) or WASM (CPU fallback). The model weights are fetched from Hugging Face's CDN on first use and cached in your browser's IndexedDB store after that, so subsequent visits work offline.
Use policy
Supertonic 3 is licensed under OpenRAIL-M, which permits commercial use but prohibits certain misuse cases. By using this tool you agree not to:
- Impersonate a real person without consent
- Generate audio for harassment, threats, or hate
- Generate political disinformation or audio designed to deceive about its origin
- Generate sexual content involving minors
- Otherwise violate the underlying Supertonic 3 license
FAQ
Why is the first generation slow?
The voice model is about 50 MB. The first time you click Generate, your browser downloads it once and caches it. After that, every generation runs from local cache.
Can I clone a custom voice?
Voice cloning is gated behind Supertone's hosted Voice Builder web app and not part of this free tool. The 10 stock voices plus 7 tone presets give 70 effective combinations, which covers most practical use cases.
Is the audio output really mine to use?
Yes, with the use-policy caveats above. Personal projects, podcasts, demos, language practice, accessibility narration, app prototypes — all fine. Just don't use it to impersonate a real person.
Why does my mobile struggle?
Older phones may run out of memory or render slowly. Pick fewer quality steps (8-12 instead of 16-24) and avoid very long passages. New iPhones and Pixels handle it fine.
Why are some words skipped or read too fast?
This is a known limitation of Supertonic 3 (tracked in the upstream GitHub issues). The diffusion-based synthesis is stochastic, and short words occasionally come out very quiet or sped through. Three things help:
- Click the Re-roll button to try a different seed. About 30 percent of seeds produce clean audio.
- Keep the Anti-skip setting on. It pads short words with commas to give them more duration budget.
- Slow the speed down to 0.90 or lower for tricky passages.