-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation
Description
I just ran bicleaner-hardrules with success - thanks for sharing and maintaining this repo !
Yet I noticed that the options described in the README.md are not up-to-date.
Here are the new options I got when running bicleaner-hardrules -h:
usage: bicleaner-hardrules [-h] [--annotated_output] [-c RULES_CONFIG]
[--tmp_dir TMP_DIR] [-b BLOCK_SIZE] [-p PROCESSES]
[--score_only] [-A] [--disable_lang_ident]
[--disable_minimal_length] [--disable_porn_removal]
[-s SOURCE_LANG] [-t TARGET_LANG] [--scol SCOL]
[--tcol TCOL] [-S SOURCE_TOKENIZER_COMMAND]
[-T TARGET_TOKENIZER_COMMAND] [--disable_lm_filter]
[--metadata METADATA] [--lm_threshold LM_THRESHOLD]
[-q] [--debug] [--logfile LOGFILE] [-v]
[input] [output]
positional arguments:
input Tab-separated bilingual tagged file (default:
<_io.TextIOWrapper name='<stdin>' encoding='UTF-8'>)
output Output of the classification (default:
<_io.TextIOWrapper name='<stdout>' mode='w'
encoding='utf-8'>)
options:
-h, --help show this help message and exit
--annotated_output Adds an extra column with each sentence's evaluation
("keep" if the sentence is good, otherwise the reason
for rejecting (default: False)
Optional:
-c RULES_CONFIG, --rules_config RULES_CONFIG
Rules configuration file (default: None)
--tmp_dir TMP_DIR Temporary directory where creating the temporary files
of this program (default: /tmp)
-b BLOCK_SIZE, --block_size BLOCK_SIZE
Sentence pairs per block (default: 10000)
-p PROCESSES, --processes PROCESSES
Number of processes to use (default: 19)
--score_only Only output one column which is the hardrule tag:
0(keep) 1(discard) (default: False)
-A, --run_all_rules Run all rules for each sentence instead of stopping at
first discard (default: False)
--disable_lang_ident Don't apply rules that use language detecting
(default: False)
--disable_minimal_length
Don't apply minimal length rule (default: False)
--disable_porn_removal
Don't apply porn removal (default: False)
-s SOURCE_LANG, --source_lang SOURCE_LANG
Source language (SL) of the input (default: None)
-t TARGET_LANG, --target_lang TARGET_LANG
Target language (TL) of the input (default: None)
--scol SCOL Source sentence column (starting in 1) (default: 1)
--tcol TCOL Target sentence column (starting in 1) (default: 2)
-S SOURCE_TOKENIZER_COMMAND, --source_tokenizer_command SOURCE_TOKENIZER_COMMAND
Source language (SL) tokenizer full command (default:
None)
-T TARGET_TOKENIZER_COMMAND, --target_tokenizer_command TARGET_TOKENIZER_COMMAND
Target language (TL) tokenizer full command (default:
None)
--disable_lm_filter Don't apply LM filtering (default: False)
--metadata METADATA Bicleaner metadata (YAML file) (default: None)
--lm_threshold LM_THRESHOLD
Threshold for language model fluency scoring.
(default: 0.5)
Logging:
-q, --quiet Silent logging mode (default: False)
--debug Debug logging mode (default: False)
--logfile LOGFILE Store log to a file (default: <_io.TextIOWrapper
name='<stderr>' mode='w' encoding='utf-8'>)
-v, --version show version of this script and exit
Not that I need it right away, but I'd be curious about passing a custom rules configuration file with the option -c RULES_CONFIG (format and possible options) :-)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation