Onboard Project: Usage

Onboarding Projects

This page details the silnlp.common.onboard_project script's usage and the configuration options available.

onboard_project

Cleans and uploads a Paratext project from a local machine to the MinIO bucket. Optionally performs other Onboarding tasks.

usage: python -m silnlp.common.onboard_project main-project [--ref-projects projects] [--copy-from [local_dir]]
[--config path_to_config] [--extract-corpora] [--collect-verse-counts] [--no-clean] [--datestamp]
[--wildebeest] [--stats] [--align] [--align-isos isos] [--clearml-queue queue] [--clearml-tag tag]

Arguments:

Argument	Purpose	Description
`main-project`	The Main Paratext project name for onboarding.	(Required) The project will be stored on the bucket at `Paratext/projects`.
`--ref-projects`	The Reference Paratext project name(s) for onboarding the main project.	The projects will be stored on the bucket at `Paratext/projects`.
`--copy-from [local_dir]`	Path to a directory with a Paratext project.	The local project(s) will be copied to the bucket. Default if included without a local_dir is the user's `Downloads` folder
`--config path_to_config`	Path to a config.yml file	This is used to configure what optional Onboarding tasks will run.
`--extract-corpora`	Runs silnlp.common.extract_corpora	Extracts corpora. See here for more information.
`--collect-verse-counts`	Runs silnlp.common.collect_verse_counts	Collects verse counts. Stores results in Stores results in MT/experiments/OnboardingRequests/project_name/vers_counts.
`--no-clean`	Skips running silnlp.common.clean_projects	Does not clean the local project before uploading to the bucket.
`--datestamp`	Appends a current datestamp to the project name	Adds a datestamp to the project folder name when creating a new Paratext project folder.
`--wildebeest`	Runs a Wildebeest analysis on the extracted corpora.	Produces a Wildebeest report for the project. Stores results in MT/experiments/OnboardingRequests/project_name/wildebeest.
`--stats`	Compute tokenization statistics	Compute tokenization statistics. Stores results in MT/experiments/OnboardingRequests/project_name/stats.
`--align`	Compute Alignments	Runs silnlp.common.analyze to compute alignments. Stores results in MT/experiments/OnboardingRequests/project_name/alignments.
`--align-isos`	List of iso codes to determine standard alignment projects. Uses assets/standard_alignments.yml	Runs silnlp.common.analyze to compute alignments. Stores results in MT/experiments/OnboardingRequests/project_name/alignments.
`--clearml-queue QUEUE`	ClearML queue	Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
`--clearml-tag tag`	ClearML tag	Tag to add to the ClearML Task. Possible tags: 'research', 'dev', 'eitl', 'onboarding'

config file

The config file contains the parameters for all of the optional onboarding tasks this script can execute.

Below is an example of a onboarding config:

extract_corpora:
  include: NT
  exclude: OT
verse_counts:
  output_folder: verse_counts/test_onboard_project
  files: *.txt
  deutero: false
  recount: false
wildebeest:
  x: 500
  n: 500
  r: vref.txt
stats:
  data:
     corpus_pairs:
       type: train
       src: project_extract_file
       trg: 
       - <reference_project_extract_files>
       lang_codes:
          iso_code: nllb_tag
align:
   data:
     aligner: eflomal
     corpus_pairs:
     - mapping: many_to_many
       src:
       - <onboarded_project>
       trg:
       - <back_translation_project>
       - <reference_project(s)>
       - <other_alignment_project(s)>
       type: train
    tokenize: false

Parameter Definitions

extract_corpora

include=[]: A list of books to include; e.g., 'NT', 'OT', 'GEN'.
exclude=[]: A list of books to exclude; e.g., 'NT', 'OT', 'GEN'.
markers=False: If true, include USFM markers in extraction.
lemmas=False: If true, extract lemmas.
project_vrefs=False: If true, extract project_vrefs.

collect_verse_counts

output_folder=path_to_output_folder: Folder to store the verse counts.
files=*.txt: Semicolon-delimited list of patterns of extract file names to count (e.g. 'en-*.txt;fr-NT.txt).
deutero=False: If true, include counts for Deuterocanon books.
recount=False: If true, force recount of verse counts.

wildebeest

x=500: max number of examples per line
n=500: max number of cases per group
r=vref.txt: file with sentence reference IDs
See the Wildebeest Repo for more info

stats

Configures the Tokenizer used for tokenization statistics.
The example shows the default used when the stats section is empty.
See Configure A Model for a full list of parameter definitions

align

Configures what projects to align with the onboarded project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Onboard Project: Usage

Onboarding Projects

onboard_project

config file

Parameter Definitions

extract_corpora

collect_verse_counts

wildebeest

stats

align

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally