Generate audiobooks from plain EPUB files in EPUB3 Media Overlays format using high-quality TTS (Text-to-Speech) engines like Azure and Kokoro, now with an intuitive Web GUI.
You can read or listen to the generated EPUB using any ebook reader that supports EPUB 3 Media Overlays, such as Thorium Reader. The generated MP3 files can also be played with any standard audio player.
- Convert plain EPUB books into audiobooks compliant with EPUB 3 Media Overlays specification.
- Supports TTS engines:
- Azure TTS (high-quality cloud service)
- Kokoro-82M (offline open-source model, currently supports English text alignment only)
- Automatic sentence segmentation and force alignment
- Parallel multi-process generation
- Gradio-based Web GUI for easy interaction without command line
- Docker-ready architecture for easy deployment
git clone https://github.com/funway/audible-epub3-maker.git
cd audible-epub3-maker
pip install -r requirements.txt
Depending on the engine you plan to use, follow the steps below:
-
Azure:
- You must configure the following two environment variables:
AZURE_TTS_KEY=your_azure_speech_key AZURE_TTS_REGION=your_speech_region
- You can define them:
- In a
.env
file in the project root (recommended) - Or
export
them manually in your shell or.bashrc
/.zshrc
file:export AZURE_TTS_KEY=your_azure_speech_key export AZURE_TTS_REGION=your_speech_region
- In a
- How to get Microsoft Azure Text-to-Speech API key
- You must configure the following two environment variables:
-
Kokoro:
- No environment configuration is required.
- The model file will automatically download on first use.
We provide pre-built Docker images hosted at:
-
Docker Hub:
funway/audible-epub3-maker
-
GitHub Container Registry (GHCR):
ghcr.io/funway/audible-epub3-maker
The image includes all dependencies and runs the Web GUI by default.
A sample configuration file docker-compose.example.yml
is included in the repository:
services:
aem-web:
image: ghcr.io/funway/audible-epub3-maker
ports:
- "7860:7860"
volumes:
- ./output:/app/output
environment:
- AZURE_TTS_KEY=your_azure_speech_key
- AZURE_TTS_REGION=your_speech_region
restart: unless-stopped
docker pull ghcr.io/funway/audible-epub3-maker
docker run -d \
-p 7860:7860 \
-v ./output:/app/output \
-e AZURE_TTS_KEY=your_azure_speech_key \
-e AZURE_TTS_REGION=your_speech_region \
ghcr.io/funway/audible-epub3-maker
-
Using Azure TTS? Make sure you set the
AZURE_TTS_KEY
andAZURE_TTS_REGION
environment variables before starting the container. -
Using Kokoro TTS? Keep an eye on your system’s memory usage — the model runs locally and can consume several GB of RAM. On low-memory systems, this may cause OOM (out-of-memory) errors.
python main.py <input_file.epub> [options]
input_file
: The path to the source EPUB file.
Option | Description | Default |
---|---|---|
-d , --output_dir |
Output directory | <input_file_stem>_audible |
--log_level |
Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) | INFO |
--tts_engine |
TTS engine (azure or kokoro ) |
azure |
--tts_lang |
Language code | en-US |
--tts_voice |
Voice name | en-US-AvaMultilingualNeural |
--tts_speed |
Playback speed (e.g., 1.0 = normal) | 1.0 |
--tts_chunk_len |
Max chars per TTS chunk | auto |
--newline_mode |
How to detect paragraph breaks from newlines (none , single , multi ) |
multi |
-m , --max_workers |
Number of worker processes | 3 |
--align_threshold |
Force alignment fuzzy match threshold (0–100) | 95.0 |
-f , --force |
Force all prompts (non-interactive mode) | false |
--cleanup |
Remove temp files (.mp3) after generation | false |
python main.py mybook.epub \
--tts_engine azure \
--tts_lang zh-CN \
--tts_voice zh-CN-XiaoxiaoNeural \
-d ./output_dir \
-m 4 \
--log_level DEBUG
python web_gui.py
Argument | Description | Default |
---|---|---|
--host |
Host to bind the Gradio web server | 127.0.0.1 |
--port |
Port to bind the Gradio web server | 7860 |
Then open your browser and interact with the friendly interface!
*.mp3
: Generated audio for each chapter*.epub
: A new EPUB file with embedded mp3 audio and synchronized smil overlays
This project is licensed under the MIT License.
- Add CPU/GPU selection for offline TTS models (note: pip install on amd64 arch defaults to pull a 4GB NVIDIA library ("▔□▔) )
- Support more TTS models
- Implement voice preview and cost estimation for commercial models
- Integrate WhisperX for audio-text alignment in TTS models without native word boundary output