Convert text or subtitle files into speech audio with options for voice cloning, emotion control, speed, and timeline-accurate dubbing using Kokoro or Noiz b...
Convert any text into speech audio. Supports two backends (Kokoro local, Noiz cloud), two modes (simple or timeline-accurate), and per-segment voice control.
speak is the default — the subcommand can be omitted:
# Basic usage (speak is implicit)
python3 skills/tts/scripts/tts.py -t "Hello world" # add -o path to save
python3 skills/tts/scripts/tts.py -f article.txt -o out.mp3
# Voice cloning — local file path or URL
python3 skills/tts/scripts/tts.py -t "Hello" --ref-audio ./ref.wav
python3 skills/tts/scripts/tts.py -t "Hello" --ref-audio https://example.com/my_voice.wav -o clone.wav
# Voice message format
python3 skills/tts/scripts/tts.py -t "Hello" --format opus -o voice.opus
python3 skills/tts/scripts/tts.py -t "Hello" --format ogg -o voice.ogg
Third-party integration (Feishu/Telegram/Discord) is documented in ref_3rd_party.md.
For precise per-segment timing (dubbing, subtitles, video narration).
If the user doesn't have one, generate from text:
python3 skills/tts/scripts/tts.py to-srt -i article.txt -o article.srt
python3 skills/tts/scripts/tts.py to-srt -i article.txt -o article.srt --cps 15 --gap 500
--cps = characters per second (default 4, good for Chinese; ~15 for English). The agent can also write SRT manually.
JSON file controlling default + per-segment voice settings. segments keys support single index "3" or range "5-8".
Kokoro voice map:
{
"default": { "voice": "zf_xiaoni", "lang": "cmn" },
"segments": {
"1": { "voice": "zm_yunxi" },
"5-8": { "voice": "af_sarah", "lang": "en-us", "speed": 0.9 }
}
}
Noiz voice map (adds emo, reference_audio support). reference_audio can be a local path or a URL (user’s own audio; Noiz only):
{
"default": { "voice_id": "voice_123", "target_lang": "zh" },
"segments": {
"1": { "voice_id": "voice_host", "emo": { "Joy": 0.6 } },
"2-4": { "reference_audio": "./refs/guest.wav" }
}
}
Dynamic Reference Audio Slicing:
If you are translating or dubbing a video and want each sentence to automatically use the audio from the original video at the exact same timestamp as its reference audio, use the --ref-audio-track argument instead of setting reference_audio in the map:
python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json --ref-audio-track original_video.mp4 -o output.wav
See examples/ for full samples.
python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json -o output.wav
python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json --backend noiz --auto-emotion -o output.wav
| Need | Recommended |
|---|---|
| Just read text aloud, no fuss | Kokoro (default) |
| EPUB/PDF audiobook with chapters | Kokoro (native support) |
Voice blending ("v1:60,v2:40") | Kokoro |
| Voice cloning from reference audio | Noiz |
Emotion control (emo param) | Noiz |
| Exact server-side duration per segment | Noiz |
When the user needs emotion control + voice cloning + precise duration together, Noiz is the only backend that supports all three.
When no API key is configured, tts.py automatically falls back to guest mode — a limited Noiz endpoint that requires no authentication. Guest mode only supports --voice-id, --speed, and --format; voice cloning, emotion, duration, and timeline rendering are not available.
# Guest mode (auto-detected when no API key is set)
python3 skills/tts/scripts/tts.py -t "Hello" --voice-id 883b6b7c -o hello.wav
# Explicit backend override to use kokoro instead
python3 skills/tts/scripts/tts.py -t "Hello" --backend kokoro
Available guest voices (15 built-in):
| voice_id | name | lang | gender | tone |
|---|---|---|---|---|
063a4491 | 販売員(なおみ) | ja | F | 喜び |
4252b9c8 | 落ち着いた女性 | ja | F | 穏やか |
578b4be2 | 熱血漢(たける) | ja | M | 怒り |
a9249ce7 | 安らぎ(みなと) | ja | M | 穏やか |
f00e45a1 | 旅人(かいと) | ja | M | 穏やか |
b4775100 | 悦悦|社交分享 | zh | F | Joyful |
77e15f2c | 婉青|情绪抚慰 | zh | F | Calm |
ac09aeb4 | 阿豪|磁性主持 | zh | M | Calm |
87cb2405 | 建国|知识科普 | zh | M | Calm |
3b9f1e27 | 小明|科技达人 | zh | M | Joyful |
95814add | Science Narration | en | M | Calm |
883b6b7c | The Mentor (Alex) | en | M | Joyful |
a845c7de | The Naturalist (Silas) | en | M | Calm |
5a68d66b | The Healer (Serena) | en | F | Calm |
0e4ab6ec | The Mentor (Maya) | en | F | Calm |
ffmpeg in PATH (timeline mode only)python3 skills/tts/scripts/tts.py config --set-api-key YOUR_KEY (guest mode works without a key but has limited features)--backend kokoro to use the local backendUse only the base64-encoded API key as Authorization—no prefix (e.g. no APIKEY or Bearer ). Any prefix causes 401.
For backend details and full argument reference, see reference.md.
ZIP package — ready to use