If you’ve read my explanation as to why I made a Blog [[A Blog, in 2025]], you might recall what I said. Make things like guides for specific technical things, especially what is worth saving long term.
If you’ve read my explanation as to why I made a Blog [[A Blog, in 2025]], you might recall what I said. Make things like guides for specific technical things, especially what is worth saving long term.

And I already decided what the first Post will be on here, before I even created the Blog.
My version of writing a guide on how to train your own Piper TTS Voice.

What is Piper?

Piper is a open source Text to Speech Voice Synthesizer tool. It lets you make an AI Voice speak. While in itself that might not be the most useful application, it is easily integrate able in Home Assistant. To have your home speak to you.
Piper can be found here, and as you can see in the README, it supports a LOT of languages https://github.com/rhasspy/piper

Why Piper?

Piper is fully integrated within Home Assistant. You can add it with a few clicks, run the Voice Synthesize locally, or outsourced on a more beefy system.
Plus, the entire voice synthesize as well as voice training can be done entirely locally. No shitty cloud or server renting necessary!

The Downsides

So, there are unfortunately some downsides to using Piper. Mainly, it seems abandoned. Commits on GitHub are spare, to say the least. The last recent one is from March 3th, 2025. And it only changed a single sentence for a description. The second last commit is from October 21th, 2024.
There are ~361 open Issues, and I can tell you that Piper runs on very specific, rather old libraries and packages.
There is no user-friendly interface to train a model either. “Easy to handle” is not a term I would use for this. Especially if you are using it for the first time.
Fortunately, I have a Instance running and working, and I can share the exact path you need to take. And I will try to simplify it a bit.

1. Requirements

You need a few things.

A Computer. With a modern NVIDIA GPU in the best case. AMD is unfortunately not supported for much of this, if any. I am using a RTX 3090.
A good Dataset.
A Linux System. Don’t worry, Windows will work too because-

2. Setup

If you have a Windows Machine, we can utilize the Windows Subsystem for Linux.

wsl --install Ubuntu-22.04

You now have a Linux system available all the time, simply by calling wsl

All commands from now on will work on an actual Linux and WSL System the same way.

Chose a main directory you want to work from. I will run from the home directory, but you can use any sub folder or place you like.

No matter if you use Piper Recording Studio or the Custom Dataset approach, you’ll want a folder in your main directory where you store the audio files and their metadata.csv.
Hence why I will create the dataprep folder.

mkdir ~/dataprep

Piper Recording Studio

If you want to clone your own voice, or someone who is close to you, you can use Piper’s own recording studio to quickly record a bunch of voice lines very easily. If you wan’t to copy some game characters voice etc., you can skip this.

# Updating our repos
sudo apt update

# Clone the Piper Recording Studio from GitHub
git clone https://github.com/rhasspy/piper-recording-studio.git

# Jump into the directory
cd piper-recording-studio

# Making a Python Virtual Environment. Replace piper_voice_studio with something else if you like.
python3 -m venv .piper_voice_studio

# Activating the Virtual Env
source .piper_voice_studio/bin/activate

# Now inside of the Environment, update PIP and install the requirements for Piper Voice Studio
pip install --upgrade pip
pip install -r requirements.txt

# The requirements.txt, for some reason, misses a few packages...we also need
pip install numpy onnxruntime

# Run the Piper Recording Studio
python3 -m piper_recording_studio

Piper Recording Studio is now running. You should see something like this in your console

2025-04-06 21:30:42 +0200] [837] [INFO] Running on http://127.0.0.1:8000 (CTRL + C to quit)
INFO:hypercorn.error:Running on http://127.0.0.1:8000 (CTRL + C to quit)

In any browser, open up localhost:8000 or 127.0.0.1:8000. What you will see is this interface:
Pasted image 20250406213255.png
You can chose a language, and then begin reading the sentences that are prompted. Each selected line can be re-recorded or saved. Whenever you’re done, close the tab.

Kill the Recording Studio by hitting CTRL + C in the console.
You can now jump into the directory of the recorded files to export them as a dataset. CD back into the piper_recording_studio main folder. Since that is right under my home directory, I’ll do

cd ~/piper_recording_studio

Here you can run the command to export the dataset. This step is only necessary if you use Piper Recording Studio. If you create your own dataset from something else, we follow a other approach. Which I personally find easier.

python3 -m export_dataset ~/piper-recording-studio/output/de-DE/0000000001_0300000050_General/ ~/dataprep/my_voice/

The path I chose creates a folder called dataprep in the home directory. This is where I handle the dataset creations. Little spoiler, in the end we will have the following main folders for the different stages.

piper_recording_studio (Optional)
dataprep
training

Running export_dataset lists all the audio files we created, and where they now exist. They have also been converted into .wav
We now have a bunch of audio files in the folder dataprep/my_voice/, as well as a `metadata.csv

This file lists of all audio files, and the text that has been spoken in them. This is what Piper will use to actually know what is being said in what file.

Once we finished the dataset export, we can close off the virtual environment

deactivate

Custom Dataset

This is where I spent the most time. Data Preparation for AI is always a field in itself. Piper Recording Studio automated the creation of audio files and their transcription. But if you want to use a other voice, let’s say a game character, we need to handle a few things our self. Not that that’s difficult, don’t worry.

How you acquire the custom dataset is entirely up to you. I can’t guide you through that, since it will be different for every source.
For the sake of my Guide, let’s say I took all audio files from the Cyberpunk 2077 character Delamain. This voice will be used for a private project only. I would advise you to not use any of this commercially, anyway. Unless it is your own voice, do whatever you want then.

Your audio files all need to follow the same format. When you convert them is up to you, but it has to be the in the correct format by the time you want to hand it over to Piper.

Sample Rate: 22050 Hz
Encoding: Signed 16-bit PCM
Channels: mono
Format: wav

You should also do your best to keep the files as clean as possible. Background noises such as music, humming or other people will confuse the AI, and it will try to replicate that. Trust me, the results will not be stellar… That’s why a data mined game character is often the best, since the audio files are perfectly clean.

With your custom voice lines at hand, put them into a folder. I recommend the same path we used for piper’s recording studio export, but a different sub folder. Since I use Delamain, my path is

~/dataprep/delamain/wav/

You want all your .wav files to be inside there, in their own wav folder.

Transcribing

You can either hand write the metadata.csv yourself, which I don’t recommend, or you create it automatically. You then just have to manually fix some mistakes, instead of writing it all by hand.

Installing Whisper

This can be done without any python virtual environment. Might be a bit overkill to isolate this single pip install into its own whole venv.

You want to CD into the dataprep folder, and run the following commands from there

# Installing Whisper
pip install git+https://github.com/openai/whisper.git

# Creating a new python script to run it
nano runwhisper.py

Paste this script into the runwhisper.py. A quick attention grab at the audio_dir variable. Replace the path with yours! Python needs to know where all your audio files are.

import os
import whisper

# Initialize Whisper model (you can choose between 'tiny', 'base', 'small', 'medium', 'large')
model = whisper.load_model("base")

# Path to the directory containing the audio files
audio_dir = "./YOUR_VOICE_NAME/wav"
output_csv = "./metadata.csv"

# List all .wav files in the directory
audio_files = [f for f in os.listdir(audio_dir) if f.endswith(".wav")]
audio_files.sort()  # Sort the files alphabetically (optional)

# Open the CSV file for writing
with open(output_csv, "w") as f:
    for audio_file in audio_files:
        # Full path to the audio file
        audio_path = os.path.join(audio_dir, audio_file)

        # Transcribe the audio file
        result = model.transcribe(audio_path)

        # Extract the transcription text
        transcription = result["text"].strip()

        # Write the filename (without .wav extension) and transcription to the CSV
        file_id = os.path.splitext(audio_file)[0]  # Get file name without extension
        f.write(f"{file_id}|{transcription}\n")

print(f"Transcriptions complete! Metadata saved to {output_csv}")

Run it

python3 runwhisper.py

This will download the base whisper model, iterate through your .wav files and create a metadata.csv for them. Once it is done, head back into your main directory for the whole project.

3. Prepare Training

We now have the dataset, no matter if it came from Piper Voice Studio, a data mined game character or anything else. We have a folder with a metadata.csv and a bunch of .wav files in the /wav folder.

Let’s create a new directory for the actual training. I am once again based on my home directory.

# Create the new directory for training
mkdir ~/training
cd ~/training

# Clone the Piper Repo
git clone https://github.com/rhasspy/piper.git

# You might need to run this before creating a venv.
# Linux will tell you so, when you run the venv creation
sudo apt update
apt install python3.10-venv

# Create a python venv for it
python3 -m venv .piper_training
source .piper_training/bin/activate

Very specific Libraries

Piper needs very specific libraries to function, so instead of a requirements.txt, we will manual install the following packages. As well as a specific pip version.

# Jump into the Piper directory
cd piper/src/python

# Install a very specific version of PIP
pip install pip==23.3.1

# Install a very specific version of numpy
pip install numpy==1.24.4

# Install a very specific version of torchmetric
pip install torchmetric==0.11.4
# If it cannot be found, use this
pip install torchmetrics==0.10.3

If you have a RTX 4090

Do the following two Blocks. If you have a other GPU, ignore it.

# IF YOU HAVE A RTX 4090
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

And change the requirements.txt in the folder.

cython>=0.29.0,<1
librosa>=0.9.2,<1
piper-phonemize~=1.1.0
numpy>=1.19.0
onnxruntime>=1.11.0
pytorch-lightning~=1.9.0
onnx

Now let’s continue with commands that apply to everyone.

# Setup wheel
pip install --upgrade wheel setuptools

# Install other requirements
pip install -e .

Now that we have the requirements set up, there is one more script that needs to be executed.

./build_monotonic_align.sh

If this runs without any errors, congrats. It did not for me 🙂 I was told that the file does not exist. Should you have the same, make the file executable and install 2 more packages:

# Make the file executable
chmod +x build_monotonic_align.sh

# Install python packages to run the sh file
sudo apt update && sudo apt install -y build-essential python3-dev

Once that is done, and you can execute build_monotonic_align.sh we are ready.

Preprocessing Dataset

While our dataset, with its metadata.csv and wav folder is nice, Piper needs the dataset in a specific format. This format can be achieved by running a Preprocessor. Make sure you are still in the /piper/src/python folder. Keep in mind that I am training the delamain voice, replace your path accordingly. You also want to change your language shortcut to whatever language you train on.

python3 -m piper_train.preprocess \
	--language en \
	--input-dir ~/dataprep/delamain \
	--output-dir ~/training/delamain \
	--dataset-format ljspeech \
	--single-speaker \
	--sample-rate 22050

If we now CD into our training folder, we will find a folder named after our voice, or whatever you specified in --output-dir.
In there, we have files like config.json, dataset.jsonl and a folder named cache.

4. Training

Running the Training

We are finally here. We can start training. But we also want to make it easy for us. Training a speech model from scratch would take ages, and why would we do that? We can take a already trained model, that has good speech, pronunciation and sound. Piper can use that model, to fine tune it. Train just a bit of it, to change the sound of it. Instead of teaching it how to speech from scratch.

Download a High, Medium or Low checkpoint from here. Chose your language, select a voice, and select the quality. https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main

You can also listen to some previews of those voices here https://rhasspy.github.io/piper-samples/

I will get a German voice, since my dataset is in German. Hence my checkpoint is piper-checkpoints/de/de_DE/thorsten/high/epoch=2665-step=1182078.ckpt

We can download the model right from the terminal. I usually save them in the piper/src/python folder.

wget https://huggingface.co/datasets/rhasspy/piper-checkpoints/resolve/main/de/de_DE/thorsten/high/epoch%3D2665-step%3D1182078.ckpt

With the checkpoint in our folder, we can start the actual training. But first, check if you have a NVIDIA GPU Driver installed.

# Try to run the nvidia info panel
nvidia-smi

# If the previous command resulted in an error, install a fitting NVIDIA driver
sudo apt update && sudo apt install -y nvidia-driver-535

# Check if we have CUDA installed
nvcc --version

# If it could not be found, install the cuda toolkit
sudo apt install -y nvidia-cuda-toolkit

Now that our Linux system is actually able to use the GPU, lets hit the training.

python3 -m piper_train \
--dataset-dir ~/training/delamain \
--accelerator 'gpu' \
--gpus 1 \
--batch-size 32 \
--validation-split 0.0 \
--num-test-examples 0 \
--max_epochs 6000 \
--resume_from_checkpoint "./epoch=2665-step=1182078.ckpt" \
--checkpoint-epochs 1 \
--precision 32 \
--strategy ddp \
--max-phoneme-ids 400 \
--quality high

After Training

We remain in the .piper_training Environment. Once the training finished, or once you manually stopped the training, you will have a checkpoint file. It will be in the training directory you set for the dataset.
Given that we are still in the piper/src/python folder, let’s go back into the main training folder, and into our voice directory. From there, we get into the checkpoint folder.

The version_x number changes automatically, depending on how many training sessions you run. Gives a nice automated sorting of the different training versions we do. If this is your first run, it will be version_0.

# Get back into our voice training folder
cd ~/training/delamain/lightning_logs/version_0/checkpoints/

In here we have the last checkpoint that was created during the training!

Exporting the Model

To actually use the model to create speech, we need to export it to a onnx format. You can run this command from anywhere, as long as you are in the .piper_training Environment.

Note that we point to the model .ckpt, and set the output to be a .onnx.

# Export a specific checkpoint to a specific location
python3 -m piper_train.export_onnx "/path/to/your/model/model.ckpt" \
"~/path/to/output/model.onnx"

In my case, the command will be. You can use quotationmarks, but if your path has no spaces, its not necessary.

python3 -m piper_train.export_onnx \
~/training/delamain/lightning_logs/version_0/checkpoints/
~/training/delamain.onnx

You will probably get some warnings and messages, but the last lines should be

INFO:piper_train.export_onnx:Exported model to /home/cliffford/training/delamain.onnx

Almost there

We need one more thing. Inside our voice training folder is a config.json. We need to copy it to the same directory of the exported .onnx model, and both files need to be used together when putting the model somewhere.
Note that we also rename the file to voicename.onnx.json, this is important.

cp ~/training/delamain/config.json ~/training/delamain.onnx.json

What you have in the end are 2 files. delamain.onnx and delamain.onnx.json. Or whatever you named yours.

5. Using the Model

We finally did it. We can use the model. Showing how to fully use the model in something like Home Assistant would be too much for this guide. But here is how you can generate some quick test lines to listen to. We are still in the .piper_training Environment.

# Install Piper TTS to actually run a finished model
pip install piper-tts

# Run a echo command and pipe that to piper, pointing it to the onnx model
# We also specificy a output file
echo "Type any sentence that should be said here" | piper -m delamain.onnx --output_file test.wav

A sound file will be generated, that you can listen to through the file explorer. We made an AI voice!

6. My Voice sounds shit

You will not need this part, if your model sounds good.
If your model sounds unclear, messy, not like your voice at all, there are a few things we can cover.

Make sure your dataset only contains that 1 voice, no mixes between different people
Remove background noises if there are any. Including microphone humming.
If you only have a few seconds of dataset in total, you will need more recordings
Check the metadata.csv, if the spoken lines are actually written down corrctly
Increase the epochs to train longer. This will only help for clearer voices, not removing background noises etc.

7. Summary

You’ve successfully trained a custom Piper TTS voice! This guide covered the setup, dataset preparation, training, and exporting processes. You can now integrate your model into systems like Home Assistant for a personalized voice experience.

Creating your own Voice for Piper