-
-
Notifications
You must be signed in to change notification settings - Fork 7
Onboard Project: Usage
Matthew Beech edited this page Apr 9, 2026
·
24 revisions
This page details the silnlp.common.onboard_project script's usage and the configuration options available.
Cleans and uploads a Paratext project from a local machine to the MinIO bucket. Optionally performs other Onboarding tasks.
usage: python -m silnlp.common.onboard_project main-project [--ref-projects projects] [--copy-from [local_dir]]
[--config path_to_config] [--extract-corpora] [--collect-verse-counts] [--no-clean] [--datestamp]
[--wildebeest] [--stats] [--align] [--align-isos isos] [--clearml-queue queue] [--clearml-tag tag]
Arguments:
| Argument | Purpose | Description |
|---|---|---|
main-project |
The Main Paratext project name for onboarding. | (Required) The project will be stored on the bucket at Paratext/projects. |
--ref-projects |
The Reference Paratext project name(s) for onboarding the main project. | The projects will be stored on the bucket at Paratext/projects. |
--copy-from [local_dir] |
Path to a directory with a Paratext project. | The local project(s) will be copied to the bucket. Default if included without a local_dir is the user's Downloads folder |
--config path_to_config |
Path to a config.yml file | This is used to configure what optional Onboarding tasks will run. |
--extract-corpora |
Runs silnlp.common.extract_corpora | Extracts corpora. See here for more information. |
--collect-verse-counts |
Runs silnlp.common.collect_verse_counts | Collects verse counts. Stores results in Stores results in MT/experiments/OnboardingRequests/project_name/vers_counts. |
--no-clean |
Skips running silnlp.common.clean_projects | Does not clean the local project before uploading to the bucket. |
--datestamp |
Appends a current datestamp to the project name | Adds a datestamp to the project folder name when creating a new Paratext project folder. |
--wildebeest |
Runs a Wildebeest analysis on the extracted corpora. | Produces a Wildebeest report for the project. Stores results in MT/experiments/OnboardingRequests/project_name/wildebeest. |
--stats |
Compute tokenization statistics | Compute tokenization statistics. Stores results in MT/experiments/OnboardingRequests/project_name/stats. |
--align |
Compute Alignments | Runs silnlp.common.analyze to compute alignments. Stores results in MT/experiments/OnboardingRequests/project_name/alignments. |
--align-isos |
List of iso codes to determine standard alignment projects. Uses assets/standard_alignments.yml | Runs silnlp.common.analyze to compute alignments. Stores results in MT/experiments/OnboardingRequests/project_name/alignments. |
--clearml-queue QUEUE |
ClearML queue | Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML. |
--clearml-tag tag |
ClearML tag | Tag to add to the ClearML Task. Possible tags: 'research', 'dev', 'eitl', 'onboarding' |
The config file contains the parameters for all of the optional onboarding tasks this script can execute.
Below is an example of a onboarding config:
extract_corpora:
include: NT
exclude: OT
verse_counts:
output_folder: verse_counts/test_onboard_project
files: *.txt
deutero: false
recount: false
wildebeest:
x: 500
n: 500
r: vref.txt
stats:
data:
corpus_pairs:
type: train
src: project_extract_file
trg:
- <reference_project_extract_files>
lang_codes:
iso_code: nllb_tag
align:
data:
aligner: eflomal
corpus_pairs:
- mapping: many_to_many
src:
- <onboarded_project>
trg:
- <back_translation_project>
- <reference_project(s)>
- <other_alignment_project(s)>
type: train
tokenize: false
-
include=[]: A list of books to include; e.g., 'NT', 'OT', 'GEN'. -
exclude=[]: A list of books to exclude; e.g., 'NT', 'OT', 'GEN'. -
markers=False: If true, include USFM markers in extraction. -
lemmas=False: If true, extract lemmas. -
project_vrefs=False: If true, extract project_vrefs.
-
output_folder=path_to_output_folder: Folder to store the verse counts. -
files=*.txt: Semicolon-delimited list of patterns of extract file names to count (e.g. 'en-*.txt;fr-NT.txt). -
deutero=False: If true, include counts for Deuterocanon books. -
recount=False: If true, force recount of verse counts.
-
x=500: max number of examples per line -
n=500: max number of cases per group -
r=vref.txt: file with sentence reference IDs - See the Wildebeest Repo for more info
- Configures the Tokenizer used for tokenization statistics.
- The example shows the default used when the
statssection is empty. - See Configure A Model for a full list of parameter definitions
- Configures what projects to align with the onboarded project