Running local text to speech using chatterbox

The open source chatterbox model provides multi-language text to speech capabilities with optional cloning of an existing reference voice. It's small footprint allows it to run locally even on moderate consumer hardware, with near real-time processing when running with gpu support. To top it all off, the MIT license makes it a viable option for both personal and commercial use cases, free of charge.

Installing chatterbox

Chatterbox provides a convenience pip module called chatterbox-tts that handles both usage and model downloads. While chatterbox version 2 has added the long-awaited multi-language support, it was also built specifically for python 3.11, and will fail to install on more modern python versions.

To install it for example on debian 13 (which ships with python 3.13), we need an isolated venv with a python 3.11 installation. The pyenv tool is perfect for installing the version we need:

sudo apt install pyenv
pyenv install 3.11

The new python version is installed in the current user's home and does not change the system's default python interpreter. To use it, we need to enable it explicitly for the project, then create the venv using it's interpreter:

pyenv local 3.11
pyenv exec python3 -m venv venv
source venv/bin/activate

The venv/ directory should now contain a copy of python 3.11 and related tools. You can verify this by checking the python version and the location the pip and python3 commands refer to:

python3 --version
type python3
type pip

If your version starts with "3.11" and the paths below it reference files inside venv/, your setup is correct.

Now prepare the venv and install chatterbox dependencies, then the library itself:

pip install --upgrade pip setuptools wheel cython
pip install numpy
pip install --no-build-isolation pkuseg
pip install chatterbox-tts

And that's all you need to make chatterbox work!

Simple text to speech

A simple text to speech application using chatterbox will look like this:

simple_tts.py

import torchaudio as ta
import torch
from chatterbox.tts import ChatterboxTTS
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

model = ChatterboxMultilingualTTS.from_pretrained(device=device)
wav = model.generate(
    "Baking a cake is an easy task if you follow the recipe.",
    language_id="en"
)
ta.save("output.wav", wav, model.sr)

The script is fairly straight-forward: We start by importing necessary libraries, then set the device to the best possible backend for inference (cuda for nvidia gpus, mps for apple or amd gpus and cpu as a final fallback). Note that running inference on a cpu will be significantly slower compared to gpu!

We then load the model onto the detected device, call the generate() method with the text we want to convert and finally save the output to an audio file on disk. You can generate text in other languages as well, just make sure to adjust the language_id parameter accordingly!

Now run the script:

python3 simple_tts.py

On the first run, the code will download the model which may take a while. If all goes well, you should have a file output.wav next to the script file once it completes, with a voice speaking the words from your input text.

Cloning a voice

Chatterbox is also capable of cloning an existing voice and using it for text to speech tasks. You will need a short audio clip of the desired speaker's voice, preferrably 5 to 10 seconds in length with no or minimal background noise.

The simply pass the path to the audio file as the audio_prompt_path parameter to the model.generate() function:

wav = model.generate(
    "Baking a cake is an easy task if you follow the recipe.",
    language_id="en",
    audio_prompt_path="/voice/to/clone.mp3"
)

The rest of the script remains unchanged.

Configurable parameters

The behavior of speech generation using chatterbox can be controlled through 3 primary config values:

exaggeration controls how much emotional emphasis the output voice should contain. The default value is 0.5 with higher values producing more intense outputs, lower ones more monotone.
cfg_weight defines how closely the model should mimic the reference voice (if you are using one). The default value is 0.5 with higher values being less strict but better at handling unknown words, lower ones sticking closer to the cloned voice but potentially having trouble with some input texts.
temperature controls how "creative" the model is when creating outputs, affecting output pacing and pronunciation. Setting it higher than the default value of 0.8 will result in more varied and unexpected outputs, lower values will produce more stable results at the cost of some expressiveness.

These parameters can be passed to the model.generate() method by name:

wav = model.generate(
    "Baking a cake is an easy task if you follow the recipe.",
    language_id="en",
    audio_prompt_path="/voice/to/clone.mp3",
    exaggeration=0.5,
    cfg_weight=0.5,
    temperature=0.8
)

In general, leaving these parameter at their default values is a good start for most text to speech use cases, but if you need more control you can adjust them to your needs. Keep in mind that the parameters do affect each other as well, for example high exaggeration values also tend to increase speech pacing audibly, which can be countered by also lowering cfg_weight at the same time.

If a parameter change doesn't result in the desired output quality, try adjusting other parameters as well and see if they improve your results.