Skip to content

ictnlp/AlignX

Repository files navigation

AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment

Mengyu Bu, Shaolei Zhang, Zhongjun He, Hua Wu, Yang Feng

Paper code

This is the official repository for EMNLP 2025 Main Conference paper "AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment".

In this paper, we propose AlignX, a two-stage and representation-level framework for enhancing the align-then-diverge pattern of LLMs and thus improves multilingual performance of pre-trained LLMs.

architecture

Install

1. Clone this repository

git clone https://github.com/ictnlp/AlignX

2. Prepare training environment

conda create -n alignx python=3.9.12
conda activate alignx
pip install -r requirements.txt

3. Prepare evaluation environment.

For evaluation, we use:

  • MMT-LLM for translation task
  • lm-evaluation-harness for general task.
git clone https://github.com/NJUNLP/MMT-LLM.git
git clone https://github.com/EleutherAI/lm-evaluation-harness.git

Dataset Preparation

We construct multilingual translation instruction data based on OPUS-100 and build multilingual general instruction data from Bactrian-X. Please refer to the paper for detailed data construction procedures.

Training

AlignX improves multilingual performance in two stages:

  • Stage 1: Continual pre-training with multilingual representation alignment.
  • Stage 2: Standard SFT on multilingual instruction data.

Below is an example training script.

# Stage 1 training
finetune=/path/to/your/script/finetune_ctr_lm_within_inst_full_parameter.py
tokenizer=/path/to/your/model
base_model=/path/to/your/model
data_path=/path/to/your/data
output=/path/to/your/checkpoints

CUDA_VISIBLE_DEVICES=0,1,2,3 python $finetune \
    --tokenizer $tokenizer --base_model $base_model \
    --data_path $data_path \
    --output_dir $output \
    --num_epochs=2 \
    --cutoff_len=512 \
    --group_by_length \
    --batch_size=128 --micro_batch_size=16 \
    --learning_rate=2e-6 \
    --output_hidden_states=True \
    --align_layer=16 \
    --contrastive_lambda=0.3 --contrastive_temperature=0.1 \
    --language_matching_intermediate_size=128 \
    --num_languages=10 \
    --language_matching_lambda=0.4


# Stage 2 training
finetune=/path/to/your/script/finetune_full_parameter.py
tokenizer=/path/to/your/model
base_model=/path/to/your/model
data_path=/path/to/your/data
output=/path/to/your/checkpoints

CUDA_VISIBLE_DEVICES=0,1,2,3 python $finetune \
    --tokenizer_path $tokenizer --base_model $base_model \
    --data_path $data_path \
    --output_dir $output \
    --num_epochs=2 \
    --cutoff_len=512 \
    --group_by_length \
    --batch_size=128 --micro_batch_size=16 \
    --learning_rate=2e-6

Citation

If you find this repository useful, please cite:

@misc{bu2025alignxadvancingmultilinguallarge,
      title={AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment}, 
      author={Mengyu Bu and Shaolei Zhang and Zhongjun He and Hua Wu and Yang Feng},
      year={2025},
      eprint={2509.24338},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24338}, 
}

About

Code for paper "AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages