Skip to content

Commit c2c7fcd

Browse files
committed
tag_ud: add option to disable doubling of line breaks.
1 parent 9746f51 commit c2c7fcd

3 files changed

Lines changed: 11 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Changelog
22

3+
### tag_ud
4+
5+
- Added option to disable doubling of line breaks
6+
37
## 1.2 - 2025-10-07
48

59
_Warning: defaults were changed and some options renamed (esp. the long option/configuration option names)._

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,8 @@ The resulting CoNLL-U vertical is output to the standard output (STDOUT) by defa
178178

179179
The option `-v` reports some basic information about the progress to the standard error output (STDERR).
180180

181+
By default, the script doubles all line breaks (new line characters) before sending the contents to the API since UD Pipe requires two consecutive line breaks (i.e. an empty line) to indicate a paragraph break. This feature can be disabled using the option `-nd` (or `--no-double-lb`, configuration option `no_double_lb`).
182+
181183
The script supports [all documented features](http://lindat.mff.cuni.cz/services/udpipe/api-reference.php) of the LINDAT UDPipe REST API: any analysis of the input can be suppressed by using the option `-na` (`--no-analysis`) and then only segmentation and tokenization will be performed; syntactic (dependency) parsing can be suppressed using the option `-np` (`--no-parsing`); input or output format can be set by the options `-if <format>` (`--input-format`) and `-of <format>` (`--output-format`); additional options may be passed to the tokenizer, tagger and syntactic parser using the corresponding options `--tokenizer`, `--tagger` and `--parser` or configuration options `tokenizer_options`, `tagger_options` and `parser_options` respectively.
182184

183185
The script can also call the [NameTag](https://lindat.mff.cuni.cz/services/nametag/) NER tool to enrich the CoNLL-U output with recognition of named entities. Use the option `-ner` (configuration option `named_entities`) with an optional specification of the NER model to use. If no model is specified with the commandline option (or the value of `auto` is used in the configuration), the same specification of model will be requested as for the UDPipe tagger, which may result into a potential failure: while NameTag accepts some basic language specifications common with the UD tagger (such as `cs`, `en` or `de`), it does not recognize others. Currently, only a model for Czech is available together with a universal multilingual model, so that languages other than `cs` are automatically analyzed using the latter one (if they are recognized at all). See also the corresponding [documentation on models](https://lindat.mff.cuni.cz/services/nametag/api-reference.php#models).

tag_ud

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,7 @@ def udpipe_lindat(model, text, ner=None, tokenizer="", tagger="", parser="", inp
9191
return result
9292

9393
def process(config, infile, outfile, verbose=False):
94+
nodblb = config.getboolean('no_double_lb')
9495
output = outfile.open('w', encoding='utf-8') if outfile else sys.stdout
9596
buffer = []
9697

@@ -103,7 +104,7 @@ def process(config, infile, outfile, verbose=False):
103104
model= config.get('model', DEFAULT_MODEL),
104105
ner = config.get('named_entities'),
105106
# UD Pipe needs double "\n\n" to enforce paragraph breaks!
106-
text= "\n".join(buffer)+"\n",
107+
text= "".join(buffer) if nodblb else "\n".join(buffer)+"\n",
107108
tokenizer= config.get('tokenizer_options', ''),
108109
tagger= None if config.getboolean('no_analysis') else config.get('tagger_options', ''),
109110
parser= None if config.getboolean('no_parsing') or config.getboolean('no_analysis') else config.get('parser_options', ''),
@@ -158,6 +159,8 @@ if __name__ == '__main__':
158159
159160
'-np' Do not perform syntactic parsing.
160161
162+
'-nd' Do not double line breaks (as required for UDPipe to enforce paragraph breaks)
163+
161164
'--tokenizer' Additional options for the tokenizer.
162165
163166
'--tagger' Additional options for the tagger (analysis).
@@ -184,6 +187,7 @@ if __name__ == '__main__':
184187
parser.add_argument("-b", "--batch", help="batch size in lines (default=1000)", type=int, dest="tagger_batch")
185188
parser.add_argument("-na", "--no-analysis", help="do NOT perform any analysis (lemmatization, tagging or parsing); perform just segmentation and tokenization", action="store_true")
186189
parser.add_argument("-np", "--no-parsing", help="do NOT perform dependency parsing (syntax)", action="store_true")
190+
parser.add_argument("-nd", "--no-double-lb", help="do NOT double line breaks (UDPipe requires a sequence of two as paragraph break)", action="store_true")
187191
parser.add_argument("--tokenizer", help="additional options for the tokenizer", type=str, dest="tokenizer_options")
188192
parser.add_argument("--tagger", help="additional options for the tagger (ignored if --no-analysis is used)", type=str, dest="tagger_options")
189193
parser.add_argument("--parser", help="additional options for the syntactic parser (ignored if --no-analysis or --no-parsing is used)", type=str, dest="parser_options")

0 commit comments

Comments
 (0)