You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -178,6 +178,8 @@ The resulting CoNLL-U vertical is output to the standard output (STDOUT) by defa
178
178
179
179
The option `-v` reports some basic information about the progress to the standard error output (STDERR).
180
180
181
+
By default, the script doubles all line breaks (new line characters) before sending the contents to the API since UD Pipe requires two consecutive line breaks (i.e. an empty line) to indicate a paragraph break. This feature can be disabled using the option `-nd` (or `--no-double-lb`, configuration option `no_double_lb`).
182
+
181
183
The script supports [all documented features](http://lindat.mff.cuni.cz/services/udpipe/api-reference.php) of the LINDAT UDPipe REST API: any analysis of the input can be suppressed by using the option `-na` (`--no-analysis`) and then only segmentation and tokenization will be performed; syntactic (dependency) parsing can be suppressed using the option `-np` (`--no-parsing`); input or output format can be set by the options `-if <format>` (`--input-format`) and `-of <format>` (`--output-format`); additional options may be passed to the tokenizer, tagger and syntactic parser using the corresponding options `--tokenizer`, `--tagger` and `--parser` or configuration options `tokenizer_options`, `tagger_options` and `parser_options` respectively.
182
184
183
185
The script can also call the [NameTag](https://lindat.mff.cuni.cz/services/nametag/) NER tool to enrich the CoNLL-U output with recognition of named entities. Use the option `-ner` (configuration option `named_entities`) with an optional specification of the NER model to use. If no model is specified with the commandline option (or the value of `auto` is used in the configuration), the same specification of model will be requested as for the UDPipe tagger, which may result into a potential failure: while NameTag accepts some basic language specifications common with the UD tagger (such as `cs`, `en` or `de`), it does not recognize others. Currently, only a model for Czech is available together with a universal multilingual model, so that languages other than `cs` are automatically analyzed using the latter one (if they are recognized at all). See also the corresponding [documentation on models](https://lindat.mff.cuni.cz/services/nametag/api-reference.php#models).
'-nd' Do not double line breaks (as required for UDPipe to enforce paragraph breaks)
163
+
161
164
'--tokenizer' Additional options for the tokenizer.
162
165
163
166
'--tagger' Additional options for the tagger (analysis).
@@ -184,6 +187,7 @@ if __name__ == '__main__':
184
187
parser.add_argument("-b", "--batch", help="batch size in lines (default=1000)", type=int, dest="tagger_batch")
185
188
parser.add_argument("-na", "--no-analysis", help="do NOT perform any analysis (lemmatization, tagging or parsing); perform just segmentation and tokenization", action="store_true")
186
189
parser.add_argument("-np", "--no-parsing", help="do NOT perform dependency parsing (syntax)", action="store_true")
190
+
parser.add_argument("-nd", "--no-double-lb", help="do NOT double line breaks (UDPipe requires a sequence of two as paragraph break)", action="store_true")
187
191
parser.add_argument("--tokenizer", help="additional options for the tokenizer", type=str, dest="tokenizer_options")
188
192
parser.add_argument("--tagger", help="additional options for the tagger (ignored if --no-analysis is used)", type=str, dest="tagger_options")
189
193
parser.add_argument("--parser", help="additional options for the syntactic parser (ignored if --no-analysis or --no-parsing is used)", type=str, dest="parser_options")
0 commit comments