If you want to,
- build a new bn-en training dataset from a noisy parallel corpora (by filtering / cleaning some pairs based on our heuristics) with corresponding vocabulary models or
- normalize a new dataset before evaluating on the model or
- remove all evaluation pairs from training pairs for a new set of training / test datasets
refer to here.
Note: This code has been refactored to support OpenNMT-py 2.0
$ cd seq2seq/
$ conda create python==3.7.9 pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch -p ./env
$ conda activate ./env # or source activate ./env (for older versions of anaconda)
$ pip install --upgrade -r requirements.txt- Note: For newer NVIDIA GPUS such as A100 or 3090 use
cudatoolkit=11.0.
$ cd seq2seq/
$ python pipeline.py -h
usage: pipeline.py [-h] --input_dir PATH --output_dir PATH --src_lang SRC_LANG
--tgt_lang TGT_LANG
[--validation_samples VALIDATION_SAMPLES]
[--src_seq_length SRC_SEQ_LENGTH]
[--tgt_seq_length TGT_SEQ_LENGTH]
[--model_prefix MODEL_PREFIX] [--eval_model PATH]
[--train_steps TRAIN_STEPS]
[--train_batch_size TRAIN_BATCH_SIZE]
[--eval_batch_size EVAL_BATCH_SIZE]
[--gradient_accum GRADIENT_ACCUM]
[--warmup_steps WARMUP_STEPS]
[--learning_rate LEARNING_RATE] [--layers LAYERS]
[--rnn_size RNN_SIZE] [--word_vec_size WORD_VEC_SIZE]
[--transformer_ff TRANSFORMER_FF] [--heads HEADS]
[--valid_steps VALID_STEPS]
[--save_checkpoint_steps SAVE_CHECKPOINT_STEPS]
[--average_last AVERAGE_LAST] [--world_size WORLD_SIZE]
[--gpu_ranks [GPU_RANKS [GPU_RANKS ...]]]
[--train_from TRAIN_FROM] [--do_train] [--do_eval]
[--nbest NBEST] [--alpha ALPHA]
optional arguments:
-h, --help show this help message and exit
--input_dir PATH, -i PATH
Input directory
--output_dir PATH, -o PATH
Output directory
--src_lang SRC_LANG Source language
--tgt_lang TGT_LANG Target language
--validation_samples VALIDATION_SAMPLES
no. of validation samples to take out from train
dataset when no validation data is present
--src_seq_length SRC_SEQ_LENGTH
maximum source sequence length
--tgt_seq_length TGT_SEQ_LENGTH
maximum target sequence length
--model_prefix MODEL_PREFIX
Prefix of the model to save
--eval_model PATH Path to the specific model to evaluate
--train_steps TRAIN_STEPS
no of training steps
--train_batch_size TRAIN_BATCH_SIZE
training batch size (in tokens)
--eval_batch_size EVAL_BATCH_SIZE
evaluation batch size (in sentences)
--gradient_accum GRADIENT_ACCUM
gradient accum
--warmup_steps WARMUP_STEPS
warmup steps
--learning_rate LEARNING_RATE
learning rate
--layers LAYERS layers
--rnn_size RNN_SIZE rnn size
--word_vec_size WORD_VEC_SIZE
word vector size
--transformer_ff TRANSFORMER_FF
transformer feed forward size
--heads HEADS no of heads
--valid_steps VALID_STEPS
validation interval
--save_checkpoint_steps SAVE_CHECKPOINT_STEPS
model saving interval
--average_last AVERAGE_LAST
average last X models
--world_size WORLD_SIZE
world size
--gpu_ranks [GPU_RANKS [GPU_RANKS ...]]
gpu ranks
--train_from TRAIN_FROM
start training from this checkpoint
--do_train Run training
--do_eval Run evaluation
--nbest NBEST sentencepiece nbest size
--alpha ALPHA sentencepiece alpha-
Sample
input_dirstructure for bn2en training and evaluation:input_dir/ |---> data/ | |---> corpus.train.bn | |---> corpus.train.en | |---> RisingNews.valid.bn | |---> RisingNews.valid.en | |---> RisingNews.test.bn | |---> RisingNews.test.en | |---> sipc.test.bn | |---> sipc.test.en.0 | |---> sipc.test.en.1 | ... |---> vocab/ | |---> bn.model | |---> en.model
- Input data files inside the
data/subdirectory must have the following format:X.type.lang(.count), whereXis any common file prefix,typeis one of{train, valid, test}andcountis an optional integer (only applicable for thetarget_lang, when there are multiple reference files). There can be multiple.train./.valid.filepairs. In absence of.valid.files,validation_samplesno of example pairs will be randomly sampled from the training files duringtraining. - The
vocabsubdirectory must hold two sentencepiece vocabulary models formatted assrc_lang.modelandtgt_lang.model
- Input data files inside the
-
After training / evaluation, the
output_dirwill have the following subdirectories with these contents.Models: All the saved modelsReports: BLEU and SACREBLEU scores on the validation files for all saved models with the givenmodel_prefix, and the scores on the test files for the giveneval_model(if the corresponding reference files are present)Outputs: Detokenized model predictions.data: Merged training files after applying subword regularization.Preprocessed: Training and validation data shards
To reproduce our results on an AWS p3.8xlarge ec2 instance, equipped with 4 Tesla V100 GPUs, run the script with the default hyperparameters. For example, for bn2en training,
$ export CUDA_VISIBLE_DEVICES=0,1,2,3
# for training
$ python pipeline.py \
--src_lang bn --tgt_lang en \
-i inputFolder/ -o outputFolder/ \
--model_prefix bn2en --do_train --do_evalFor single GPU training, additionally provide the following flags: --world_size 1, --gpu_ranks 0 and update the effective batch size according to available GPU VRAM using the flags --train_batch_size X and --gradient_accum X.
For evaluating trained models on a single GPU on new test files, use the following snippet with appropriate arguments:
$ python pipeline.py
--src_lang bn --tgt_lang en \
-i inputFolder/ -o outputFolder/ \
--eval_model <path/to/model> \
--do_eval