tag_ud: add option to disable doubling of line breaks.

wanthalf · wanthalf · commit c2c7fcd73b75 · 2025-10-08T19:45:59.000+02:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,9 @@
 # Changelog
 
+### tag_ud
+
+- Added option to disable doubling of line breaks
+
 ##  1.2 - 2025-10-07
 
 _Warning: defaults were changed and some options renamed (esp. the long option/configuration option names)._
diff --git a/README.md b/README.md
@@ -178,6 +178,8 @@ The resulting CoNLL-U vertical is output to the standard output (STDOUT) by defa
 
 The option `-v` reports some basic information about the progress to the standard error output (STDERR).
 
+By default, the script doubles all line breaks (new line characters) before sending the contents to the API since UD Pipe requires two consecutive line breaks (i.e. an empty line) to indicate a paragraph break. This feature can be disabled using the option `-nd` (or `--no-double-lb`, configuration option `no_double_lb`).
+
 The script supports [all documented features](http://lindat.mff.cuni.cz/services/udpipe/api-reference.php) of the LINDAT UDPipe REST API: any analysis of the input can be suppressed by using the option `-na` (`--no-analysis`) and then only segmentation and tokenization will be performed; syntactic (dependency) parsing can be suppressed using the option `-np` (`--no-parsing`); input or output format can be set by the options `-if <format>` (`--input-format`) and `-of <format>` (`--output-format`); additional options may be passed to the tokenizer, tagger and syntactic parser using the corresponding options `--tokenizer`, `--tagger` and `--parser` or configuration options `tokenizer_options`, `tagger_options` and `parser_options` respectively.
 
 The script can also call the [NameTag](https://lindat.mff.cuni.cz/services/nametag/) NER tool to enrich the CoNLL-U output with recognition of named entities. Use the option `-ner` (configuration option `named_entities`) with an optional specification of the NER model to use. If no model is specified with the commandline option (or the value of `auto` is used in the configuration), the same specification of model will be requested as for the UDPipe tagger, which may result into a potential failure: while NameTag accepts some basic language specifications common with the UD tagger (such as `cs`, `en` or `de`), it does not recognize others. Currently, only a model for Czech is available together with a universal multilingual model, so that languages other than `cs` are automatically analyzed using the latter one (if they are recognized at all). See also the corresponding [documentation on models](https://lindat.mff.cuni.cz/services/nametag/api-reference.php#models).
diff --git a/tag_ud b/tag_ud
@@ -91,6 +91,7 @@ def udpipe_lindat(model, text, ner=None, tokenizer="", tagger="", parser="", inp
     return result
 
 def process(config, infile, outfile, verbose=False):
+    nodblb = config.getboolean('no_double_lb')
     output = outfile.open('w', encoding='utf-8') if outfile else sys.stdout
     buffer = []
 
@@ -103,7 +104,7 @@ def process(config, infile, outfile, verbose=False):
             model= config.get('model', DEFAULT_MODEL),
             ner = config.get('named_entities'),
             # UD Pipe needs double "\n\n" to enforce paragraph breaks!
-            text= "\n".join(buffer)+"\n",
+            text= "".join(buffer) if nodblb else "\n".join(buffer)+"\n",
             tokenizer= config.get('tokenizer_options', ''),
             tagger= None if config.getboolean('no_analysis') else config.get('tagger_options', ''),
             parser= None if config.getboolean('no_parsing') or config.getboolean('no_analysis') else config.get('parser_options', ''),
@@ -158,6 +159,8 @@ if __name__ == '__main__':
 
              '-np' Do not perform syntactic parsing.
 
+             '-nd' Do not double line breaks (as required for UDPipe to enforce paragraph breaks)
+
              '--tokenizer' Additional options for the tokenizer.
 
              '--tagger' Additional options for the tagger (analysis).
@@ -184,6 +187,7 @@ if __name__ == '__main__':
     parser.add_argument("-b", "--batch", help="batch size in lines (default=1000)", type=int, dest="tagger_batch")
     parser.add_argument("-na", "--no-analysis", help="do NOT perform any analysis (lemmatization, tagging or parsing); perform just segmentation and tokenization", action="store_true")
     parser.add_argument("-np", "--no-parsing", help="do NOT perform dependency parsing (syntax)", action="store_true")
+    parser.add_argument("-nd", "--no-double-lb", help="do NOT double line breaks (UDPipe requires a sequence of two as paragraph break)", action="store_true")
     parser.add_argument("--tokenizer", help="additional options for the tokenizer", type=str, dest="tokenizer_options")
     parser.add_argument("--tagger", help="additional options for the tagger (ignored if --no-analysis is used)", type=str, dest="tagger_options")
     parser.add_argument("--parser", help="additional options for the syntactic parser (ignored if --no-analysis or --no-parsing is used)", type=str, dest="parser_options")