Skip to content

bicleaner-hardrules options description needs update in the README #12

@OrianeN

Description

@OrianeN

I just ran bicleaner-hardrules with success - thanks for sharing and maintaining this repo !

Yet I noticed that the options described in the README.md are not up-to-date.

Here are the new options I got when running bicleaner-hardrules -h:

usage: bicleaner-hardrules [-h] [--annotated_output] [-c RULES_CONFIG]
                           [--tmp_dir TMP_DIR] [-b BLOCK_SIZE] [-p PROCESSES]
                           [--score_only] [-A] [--disable_lang_ident]
                           [--disable_minimal_length] [--disable_porn_removal]
                           [-s SOURCE_LANG] [-t TARGET_LANG] [--scol SCOL]
                           [--tcol TCOL] [-S SOURCE_TOKENIZER_COMMAND]
                           [-T TARGET_TOKENIZER_COMMAND] [--disable_lm_filter]
                           [--metadata METADATA] [--lm_threshold LM_THRESHOLD]
                           [-q] [--debug] [--logfile LOGFILE] [-v]
                           [input] [output]

positional arguments:
  input                 Tab-separated bilingual tagged file (default:
                        <_io.TextIOWrapper name='<stdin>' encoding='UTF-8'>)
  output                Output of the classification (default:
                        <_io.TextIOWrapper name='<stdout>' mode='w'
                        encoding='utf-8'>)

options:
  -h, --help            show this help message and exit
  --annotated_output    Adds an extra column with each sentence's evaluation
                        ("keep" if the sentence is good, otherwise the reason
                        for rejecting (default: False)

Optional:
  -c RULES_CONFIG, --rules_config RULES_CONFIG
                        Rules configuration file (default: None)
  --tmp_dir TMP_DIR     Temporary directory where creating the temporary files
                        of this program (default: /tmp)
  -b BLOCK_SIZE, --block_size BLOCK_SIZE
                        Sentence pairs per block (default: 10000)
  -p PROCESSES, --processes PROCESSES
                        Number of processes to use (default: 19)
  --score_only          Only output one column which is the hardrule tag:
                        0(keep) 1(discard) (default: False)
  -A, --run_all_rules   Run all rules for each sentence instead of stopping at
                        first discard (default: False)
  --disable_lang_ident  Don't apply rules that use language detecting
                        (default: False)
  --disable_minimal_length
                        Don't apply minimal length rule (default: False)
  --disable_porn_removal
                        Don't apply porn removal (default: False)
  -s SOURCE_LANG, --source_lang SOURCE_LANG
                        Source language (SL) of the input (default: None)
  -t TARGET_LANG, --target_lang TARGET_LANG
                        Target language (TL) of the input (default: None)
  --scol SCOL           Source sentence column (starting in 1) (default: 1)
  --tcol TCOL           Target sentence column (starting in 1) (default: 2)
  -S SOURCE_TOKENIZER_COMMAND, --source_tokenizer_command SOURCE_TOKENIZER_COMMAND
                        Source language (SL) tokenizer full command (default:
                        None)
  -T TARGET_TOKENIZER_COMMAND, --target_tokenizer_command TARGET_TOKENIZER_COMMAND
                        Target language (TL) tokenizer full command (default:
                        None)
  --disable_lm_filter   Don't apply LM filtering (default: False)
  --metadata METADATA   Bicleaner metadata (YAML file) (default: None)
  --lm_threshold LM_THRESHOLD
                        Threshold for language model fluency scoring.
                        (default: 0.5)

Logging:
  -q, --quiet           Silent logging mode (default: False)
  --debug               Debug logging mode (default: False)
  --logfile LOGFILE     Store log to a file (default: <_io.TextIOWrapper
                        name='<stderr>' mode='w' encoding='utf-8'>)
  -v, --version         show version of this script and exit

Not that I need it right away, but I'd be curious about passing a custom rules configuration file with the option -c RULES_CONFIG (format and possible options) :-)

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions