AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment

Mengyu Bu, Shaolei Zhang, Zhongjun He, Hua Wu, Yang Feng

This is the official repository for EMNLP 2025 Main Conference paper "AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment".

In this paper, we propose AlignX, a two-stage and representation-level framework for enhancing the align-then-diverge pattern of LLMs and thus improves multilingual performance of pre-trained LLMs.

Install

1. Clone this repository

git clone https://github.com/ictnlp/AlignX

2. Prepare training environment

conda create -n alignx python=3.9.12
conda activate alignx
pip install -r requirements.txt

3. Prepare evaluation environment.

For evaluation, we use:

MMT-LLM for translation task
lm-evaluation-harness for general task.

git clone https://github.com/NJUNLP/MMT-LLM.git
git clone https://github.com/EleutherAI/lm-evaluation-harness.git

Dataset Preparation

We construct multilingual translation instruction data based on OPUS-100 and build multilingual general instruction data from Bactrian-X. Please refer to the paper for detailed data construction procedures.

Training

AlignX improves multilingual performance in two stages:

Stage 1: Continual pre-training with multilingual representation alignment.
Stage 2: Standard SFT on multilingual instruction data.

Below is an example training script.

# Stage 1 training
finetune=/path/to/your/script/finetune_ctr_lm_within_inst_full_parameter.py
tokenizer=/path/to/your/model
base_model=/path/to/your/model
data_path=/path/to/your/data
output=/path/to/your/checkpoints

CUDA_VISIBLE_DEVICES=0,1,2,3 python $finetune \
    --tokenizer $tokenizer --base_model $base_model \
    --data_path $data_path \
    --output_dir $output \
    --num_epochs=2 \
    --cutoff_len=512 \
    --group_by_length \
    --batch_size=128 --micro_batch_size=16 \
    --learning_rate=2e-6 \
    --output_hidden_states=True \
    --align_layer=16 \
    --contrastive_lambda=0.3 --contrastive_temperature=0.1 \
    --language_matching_intermediate_size=128 \
    --num_languages=10 \
    --language_matching_lambda=0.4


# Stage 2 training
finetune=/path/to/your/script/finetune_full_parameter.py
tokenizer=/path/to/your/model
base_model=/path/to/your/model
data_path=/path/to/your/data
output=/path/to/your/checkpoints

CUDA_VISIBLE_DEVICES=0,1,2,3 python $finetune \
    --tokenizer_path $tokenizer --base_model $base_model \
    --data_path $data_path \
    --output_dir $output \
    --num_epochs=2 \
    --cutoff_len=512 \
    --group_by_length \
    --batch_size=128 --micro_batch_size=16 \
    --learning_rate=2e-6

Citation

If you find this repository useful, please cite:

@misc{bu2025alignxadvancingmultilinguallarge,
      title={AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment}, 
      author={Mengyu Bu and Shaolei Zhang and Zhongjun He and Hua Wu and Yang Feng},
      year={2025},
      eprint={2509.24338},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24338}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
figures		figures
utils		utils
README.md		README.md
alpaca.json		alpaca.json
configuration_llama_with_contrastive_learning_and_langauge_matching.py		configuration_llama_with_contrastive_learning_and_langauge_matching.py
finetune_ctr_lm_within_inst_full_parameter.py		finetune_ctr_lm_within_inst_full_parameter.py
finetune_full_parameter.py		finetune_full_parameter.py
modeling_llama_with_contrastive_learning_and_language_matching_within_inst.py		modeling_llama_with_contrastive_learning_and_language_matching_within_inst.py
requirements.txt		requirements.txt
trainer_with_src_tgt_index.py		trainer_with_src_tgt_index.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment

Install

1. Clone this repository

2. Prepare training environment

3. Prepare evaluation environment.

Dataset Preparation

Training

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment

Install

1. Clone this repository

2. Prepare training environment

3. Prepare evaluation environment.

Dataset Preparation

Training

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages