Fix configuration of punctuation symbol elements.

wanthalf · wanthalf · commit d2fe07480d19 · 2025-10-08T20:45:50.000+02:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,8 +1,15 @@
 # Changelog
 
+## 1.2.1 - 2025-10-08
+
 ### tag_ud
 
 - Added option to disable doubling of line breaks
+- Fix configuration of punctuation symbol elements
+
+### xml2vrt
+
+- Fix configuration of punctuation symbol elements
 
 ##  1.2 - 2025-10-07
 
diff --git a/README.md b/README.md
@@ -102,7 +102,7 @@ If you want to *exclude* some attributes provided by the analysis from the resul
 
 Names of the elements for tokens (`w`) and sentences (`s`) may be specified using the options `-te <element_name>` and `-se <element_name>` (or configuration options `token_element` and `sentence_element`).
 
-A separate element for punctuation symbols can be applied instead of the common token element, such as the `pc` element defined by TEI Guidelines. This requires a specification of a rule to identify such special tokens. The rule can be specified using the option `-pc <rule>` (configuration option `punctuation`) and must have the form `<attribute_name>=<value>`. If the corresponding attribute with the provided value is found, the token will be annotated using the element `pc` instead of the common token element name. The value is tested (matched) as a regular expression. For UD annotation, the rule can be defined as `upos=PUNCT`. Non-compliant rule specification (e.g. not containing the symbol `=`) will be silently ignored. An element name for the punctuation symbols other than `pc` can also be specified using the option `-pe <element_name>` or configuration option `punctuation_element`.
+A separate element for punctuation symbols can be applied instead of the common token element, such as the `pc` element defined by TEI Guidelines. This requires a specification of a rule to identify such special tokens. The rule can be specified using the option `-pc <rule>` (configuration option `punctuation`) and must have the form `<attribute_name>=<value>`. If the corresponding attribute with the provided value is found, the token will be annotated using the element `pc` instead of the common token element name. The value is tested (matched) as a regular expression. For UD annotation, the rule can be defined as `upos=PUNCT`. Non-compliant rule specification (e.g. not containing the symbol `=`) will be silently ignored. An element name for the punctuation symbols other than `pc` can also be specified using the option `-pe <element_name>` or configuration option `punct_element`.
 
 For annotation in the **CoNLL-U format** (Universal Dependencies), the number and role of the attributes is fixed. Their standard names are therefore already configured in the provided `xmlanntools.ini` configuration in the form of the profile called `conllu`. In addition, a special preprocessor (`conllu`) is applied to deal with the two-level tokenization generated by the UD parser. Since the virtual "syntactic words" do not really occur in the original text file, they cannot be annotated separately. Therefore, two options are available:
 
@@ -198,7 +198,7 @@ By default, the script will automatically remove tags within the token string it
 
 By default, the script will also **flatten any nested XML structures**, since nesting of elements of the same name is usually not supported by the search engines based on vertical format. At the beginning of any nested element with the same name as one of its parents, the parent element will be closed and a new element will be opened, merging its own attributes with the attributes of its parent: new attributes of the child will be appended and values of identical attributes will be concatenated. In addition, the child will get a new attribute `nesting_level` set to the level of nesting (starting with 1 for the first nested child level; the attribute name can be changed using the configuration option `vrt_flat_level_attribute`) - only the top-most parent will keep its original attributes only. At the end of the nested child element, its immediate parent will be reopened with its own attributes again. The default separator used for concatenation of attribute values (a single space by default) can be specified using the configuration option `vrt_flat_separator`, or more specifically `vrt_flat_separator_X_Y` for any particular attribute `Y` of any element `X`. Instead of concatenation, the values of children attributes may also override the values of the corresponding attributes of their parent completely. This can be activated generally by setting the configuration option `vrt_flat_override`, or specifically by the option `vrt_flat_override_X_Y` just for particular attributes `Y` of particular elements `X`. The flattening can also be completely deactivated using the option `-nf/--no-flattening` (configuration option `vrt_no_flattening`).
 
-If there are text contents found within elements other than the specified token element (`w` by default, can be specified using the option `-te <name>`, configuration option `token_element`), the whole text fragments are output as single line "tokens" by default. Using the option `-df` (or `--discard-freetext`, configuration option `vrt_discard_freetext`) they will be completely discarded from the output.
+If there are text contents found within elements other than the specified token element (`w` by default, can be specified using the option `-te <name>`, configuration option `token_element`) or the specified punctuation element (`pc` by default, can be specified using the option `-pe <name>` or configuration option `punct_element`), the whole text fragments are output as single line "tokens" by default. By using the option `-df` (or `--discard-freetext`, configuration option `vrt_discard_freetext`), they will be completely discarded from the output.
 
 By default, the whole root element of the XML file will be extracted into the vertical. If just some particular subelements should be extracted, they can be specified using the option `-i <element_names>` (where element names are again listed as a single, comma separated list without spaces) or the configuration option `vrt_include_elements` (where whitespace is allowed too). These elements are *not* expected to be nested within each other.
 
diff --git a/ann2standoff b/ann2standoff
@@ -376,7 +376,7 @@ if __name__ == '__main__':
              Default: 'w'.
 
              '-pe <element_name>' Name of the resulting XML elemenent for punctuation symbols.
-             Config setting: 'punctuation_element'.
+             Config setting: 'punct_element'.
              Default: 'pc'.
 
              '-se <element_name>' Name of the resulting XML elemenent for sentences.
@@ -419,7 +419,7 @@ if __name__ == '__main__':
     parser.add_argument("-a", "--attributes", help="attribute names (except first position, separated by comma)", type=str)
     parser.add_argument("-ea", "--exclude-attributes", help="attribute names (except first position, separated by comma)", type=str)
     parser.add_argument("-te", "--token-element", help="name of token element", type=str)
-    parser.add_argument("-pe", "--punctuation-element", help="name of punctuation element", type=str)
+    parser.add_argument("-pe", "--punct-element", help="name of punctuation element", type=str)
     parser.add_argument("-pc", "--punctuation", help="identification of punctuation", type=str)
     parser.add_argument("-se", "--sentence-element", help="name of sentence element", type=str)
     parser.add_argument("-ne", "--ne-element", help="name of element marking named entities", type=str)
diff --git a/xml2vrt b/xml2vrt
@@ -376,6 +376,10 @@ if __name__ == "__main__":
              Default: 'w'.
              Config setting: 'token_element'.
 
+             '-pe <element_name>' Name of the XML elemenent for punctuation symbols.
+             Config setting: 'punct_element'.
+             Default: 'pc'.
+
              '-i <element_names>' Comma separated list of (sub)element names to be extracted into the output vertical
              (no spaces!). By default the whole root element of the XML document will be extracted.
              Config setting: 'vrt_include_elements' (may also be separated by spaces, commas, linebreaks or
@@ -403,6 +407,7 @@ if __name__ == "__main__":
     parser.add_argument("-p", "--profile", help="config profile to use", type=str, default='DEFAULT')
     parser.add_argument("-a", "--attributes", help="attribute names (except first position, separated by comma)", type=str)
     parser.add_argument("-te", "--token-element", help="name of token element", type=str)
+    parser.add_argument("-pe", "--punct-element", help="name of punctuation element", type=str)
     parser.add_argument("-i", "--include-elements", help="(sub)elements to extract (default: document root)", type=str, dest='vrt_include_elements')
     parser.add_argument("-e", "--exclude-elements", help="elements to skip (exclude from the extraction)", type=str, dest='vrt_exclude_elements')
     parser.add_argument("-kt", "--keep-token-tags", help="keep tags within tokens", action="store_true", dest="vrt_keep_token_tags")
diff --git a/xmlanntools.ini b/xmlanntools.ini
@@ -18,7 +18,7 @@ virtual_tokens = dtok
 punctuation = upos=PUNCT
 #virtual_token_attr = form
 #token_element = w
-#punctuation_element = pc
+#punct_element = pc
 #sentence_element = s
 #ne_element = ne
 #ne_type_attr = type