punctuation-restoration-for-structure-understanding

This is our code that accompanies a RepL4NLP Paper Punctuation restoration improves structure understanding without supervision, where we apply punctuation restoration as an unsupervised structure learning objective that complements masked language modeling in T5 models. The additional training in punctuation restoration results in improved in-distribution and out-of-distribution performance in various structure-related NLP tasks such as named entity recognition, open information extraction, and semantic role labeling.

Important: Make sure you are at the repo root before running python files!

Data-prepping

Run a file under data/ to generate training data using the specific dataset. Shared-task datasets' files will generate a separate JSONL file for each implemented task. JSONL files are under outputs/datasets/.

For example, the following loads the CoNLL 2003 dataset and generates data for POS and NER.

python -m data.conll_2003

for now, it is necessary to use the -m and run it as a module.

OIE datasets may require additional local data files. Refer to implementation details.

Pre-training and Fine-tuning

python train.py TASK -n CKPT_NAME

where TASK is one of

chunking: Chunking
mlm: Masked Language Modelling
ner: Named Entity Recognition
oie: Open Information Extraction
pos: Part-of-speech Tagging
pr: Punctuation Restoration
re: Relation Extraction
srl: Semantic Role Labelling

and CKPT_NAME is a string that will be prefixed to the generating checkpoint. Checkpoints are under outputs/checkpoints/.

Optional arguments:

-d: path to a JSONL file containing training data. By default, a file associated with the task is used.
-e: number of epochs to run. -e 30 to run exactly 30 epochs, -e 1-3 to run at least 1 and at most 3.
-k: number of epochs to save. This saves the epochs with the $k$-most minimum validation losses.
-r: path to a checkpoint to resume training on.
-s: index of an epoch to save. Starts at 0 (0 is first epoch). Can be provided multiple times.
--save-last-epoch: save the last epoch.

Evaluating

python eval.py TASK -c CKPT -n MODEL_NAME

where TASK is one of

chunking: Chunking
ner: Named Entity Recognition
oie: Open Information Extraction
pos: Part-of-speech Tagging
pr: Punctuation Restoration
re: Relation Extraction
sbd: Sentence Boundary Detection
srl: Semantic Role Labelling

CKPT is the path to the checkpoint used for generating outputs, and MODEL_NAME is a string that will be used for the generation result cache and printed information. Generation result cache files are under outputs/generated/.

Optional arguments:

-d: path to a JSONL file containing evaluation data. By default, a file associated with the task is used.
--strict: use a stricter evaluation metric dependent on the task. Might not always have an effect.

Logs

Tensorboard logs can be viewed at localhost:6006 by:

tensorboard --logdir=outputs/logs --host=0.0.0.0

If you're on GU CLI, refer to the GU CLI remote dev guide to set up a tunnel to view the logs on your local machine.

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
data		data
models		models
tasks		tasks
.gitignore		.gitignore
README.md		README.md
eval.py		eval.py
requirements.txt		requirements.txt
run_instructions.sh		run_instructions.sh
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

punctuation-restoration-for-structure-understanding

Data-prepping

Pre-training and Fine-tuning

Evaluating

Logs

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

punctuation-restoration-for-structure-understanding

Data-prepping

Pre-training and Fine-tuning

Evaluating

Logs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages