XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs

Overview 🔍

XY-Tokenizer is a novel speech codec designed to bridge the gap between speech signals and large language models by simultaneously modeling both semantic and acoustic information. It operates at a bitrate of 1 kbps (1000 bps), using 8-layer Residual Vector Quantization (RVQ8) at a 12.5 Hz frame rate.

At this ultra-low bitrate, XY-Tokenizer achieves performance comparable to state-of-the-art speech codecs that focus on only one aspect—either semantic or acoustic—while XY-Tokenizer performs strongly on both. For detailed information about the model and demos, please refer to our Blog. You can also find the model on Hugging Face.

Highlights ✨

Low frame rate, low bitrate with high fidelity and text alignment: Achieves strong semantic alignment and acoustic quality at 12.5Hz and 1kbps.
Multilingual training on the full Emilia dataset: Trained on a large-scale multilingual dataset, supporting robust performance across diverse languages.
Designed for Speech LLMs: Can be used for zero-shot TTS, dialogue TTS (e.g., MOSS-TTSD), and speech large language models.

News 📢

[2025-06-28] We released the code and checkpoints of XY-Tokenizer. Check out our paper and demo!
[2025-07-11] We released the XY-Tokenizer blog with detailed technical insights and experimental results!

Installation 🛠️

To use XY-Tokenizer, you need to install the required dependencies. You can use either pip or conda to set up your environment.

Using conda

# Clone repository
git clone git@github.com:gyt1145028706/XY-Tokenizer.git && cd XY-Tokenizer

# Create and activate conda environment
conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer

# Install dependencies
pip install -r requirements.txt

Available Models 🗂️

Model Name	Hugging Face	Training Data
XY-Tokenizer	🤗	Emilia
XY-Tokenizer-TTSD-V0 (used in MOSS-TTSD)	🤗	Emilia + Internal Data (containing general audio)

Usage 🚀

Download XY Tokenizer

You need to download the XY Tokenizer model weights. You can find the weights in the XY_Tokenizer Hugging Face repository.

mkdir -p ./weights && huggingface-cli download fdugyt/XY_Tokenizer xy_tokenizer.ckpt --local-dir ./weights/

Local Inference

First, set the Python path to include this repository:

export PYTHONPATH=$PYTHONPATH:./

Then you can tokenize audio to speech tokens and generate reconstructed audio from these tokens by running:

python inference.py

The reconstructed audio files will be available in the output_wavs/ directory.

Demos 🎮

See our blog for more demos at Blog

License 📜

XY-Tokenizer is released under the Apache 2.0 license.

Citation 📚

@misc{gong2025xytokenizermitigatingsemanticacousticconflict,
      title={XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs}, 
      author={Yitian Gong and Luozhijie Jin and Ruifan Deng and Dong Zhang and Xin Zhang and Qinyuan Cheng and Zhaoye Fei and Shimin Li and Xipeng Qiu},
      year={2025},
      eprint={2506.23325},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2506.23325}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
config		config
input_wavs		input_wavs
utils		utils
xy_tokenizer		xy_tokenizer
.gitignore		.gitignore
inference.py		inference.py
inference_for_codec_evaluation.py		inference_for_codec_evaluation.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs

Overview 🔍

Highlights ✨

News 📢

Installation 🛠️

Using conda

Available Models 🗂️

Usage 🚀

Download XY Tokenizer

Local Inference

Demos 🎮

License 📜

Citation 📚

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs

Overview 🔍

Highlights ✨

News 📢

Installation 🛠️

Using conda

Available Models 🗂️

Usage 🚀

Download XY Tokenizer

Local Inference

Demos 🎮

License 📜

Citation 📚

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages