Skip to content

Onboard Project: Usage

Matthew Beech edited this page Apr 9, 2026 · 24 revisions

Onboarding Projects

This page details the silnlp.common.onboard_project script's usage and the configuration options available.

onboard_project

Cleans and uploads a Paratext project from a local machine to the MinIO bucket. Optionally performs other Onboarding tasks.

usage: python -m silnlp.common.onboard_project main-project [--ref-projects projects] [--copy-from [local_dir]]
[--config path_to_config] [--extract-corpora] [--collect-verse-counts] [--no-clean] [--datestamp]
[--wildebeest] [--stats] [--align] [--align-isos isos] [--clearml-queue queue] [--clearml-tag tag]

Arguments:

Argument Purpose Description
main-project The Main Paratext project name for onboarding. (Required) The project will be stored on the bucket at Paratext/projects.
--ref-projects The Reference Paratext project name(s) for onboarding the main project. The projects will be stored on the bucket at Paratext/projects.
--copy-from [local_dir] Path to a directory with a Paratext project. The local project(s) will be copied to the bucket. Default if included without a local_dir is the user's Downloads folder
--config path_to_config Path to a config.yml file This is used to configure what optional Onboarding tasks will run.
--extract-corpora Runs silnlp.common.extract_corpora Extracts corpora. See here for more information.
--collect-verse-counts Runs silnlp.common.collect_verse_counts Collects verse counts. Stores results in Stores results in MT/experiments/OnboardingRequests/project_name/vers_counts.
--no-clean Skips running silnlp.common.clean_projects Does not clean the local project before uploading to the bucket.
--datestamp Appends a current datestamp to the project name Adds a datestamp to the project folder name when creating a new Paratext project folder.
--wildebeest Runs a Wildebeest analysis on the extracted corpora. Produces a Wildebeest report for the project. Stores results in MT/experiments/OnboardingRequests/project_name/wildebeest.
--stats Compute tokenization statistics Compute tokenization statistics. Stores results in MT/experiments/OnboardingRequests/project_name/stats.
--align Compute Alignments Runs silnlp.common.analyze to compute alignments. Stores results in MT/experiments/OnboardingRequests/project_name/alignments.
--align-isos List of iso codes to determine standard alignment projects. Uses assets/standard_alignments.yml Runs silnlp.common.analyze to compute alignments. Stores results in MT/experiments/OnboardingRequests/project_name/alignments.
--clearml-queue QUEUE ClearML queue Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
--clearml-tag tag ClearML tag Tag to add to the ClearML Task. Possible tags: 'research', 'dev', 'eitl', 'onboarding'

config file

The config file contains the parameters for all of the optional onboarding tasks this script can execute.

Below is an example of a onboarding config:

extract_corpora:
  include: NT
  exclude: OT
verse_counts:
  output_folder: verse_counts/test_onboard_project
  files: *.txt
  deutero: false
  recount: false
wildebeest:
  x: 500
  n: 500
  r: vref.txt
stats:
  data:
     corpus_pairs:
       type: train
       src: project_extract_file
       trg: 
       - <reference_project_extract_files>
       lang_codes:
          iso_code: nllb_tag
align:
   data:
     aligner: eflomal
     corpus_pairs:
     - mapping: many_to_many
       src:
       - <onboarded_project>
       trg:
       - <back_translation_project>
       - <reference_project(s)>
       - <other_alignment_project(s)>
       type: train
    tokenize: false

Parameter Definitions

extract_corpora

  • include=[]: A list of books to include; e.g., 'NT', 'OT', 'GEN'.
  • exclude=[]: A list of books to exclude; e.g., 'NT', 'OT', 'GEN'.
  • markers=False: If true, include USFM markers in extraction.
  • lemmas=False: If true, extract lemmas.
  • project_vrefs=False: If true, extract project_vrefs.

collect_verse_counts

  • output_folder=path_to_output_folder: Folder to store the verse counts.
  • files=*.txt: Semicolon-delimited list of patterns of extract file names to count (e.g. 'en-*.txt;fr-NT.txt).
  • deutero=False: If true, include counts for Deuterocanon books.
  • recount=False: If true, force recount of verse counts.

wildebeest

  • x=500: max number of examples per line
  • n=500: max number of cases per group
  • r=vref.txt: file with sentence reference IDs
  • See the Wildebeest Repo for more info

stats

  • Configures the Tokenizer used for tokenization statistics.
  • The example shows the default used when the stats section is empty.
  • See Configure A Model for a full list of parameter definitions

align

  • Configures what projects to align with the onboarded project

Clone this wiki locally