Skip to content

Commit d2fe074

Browse files
committed
Fix configuration of punctuation symbol elements.
1 parent c2c7fcd commit d2fe074

5 files changed

Lines changed: 17 additions & 5 deletions

File tree

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,15 @@
11
# Changelog
22

3+
## 1.2.1 - 2025-10-08
4+
35
### tag_ud
46

57
- Added option to disable doubling of line breaks
8+
- Fix configuration of punctuation symbol elements
9+
10+
### xml2vrt
11+
12+
- Fix configuration of punctuation symbol elements
613

714
## 1.2 - 2025-10-07
815

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ If you want to *exclude* some attributes provided by the analysis from the resul
102102

103103
Names of the elements for tokens (`w`) and sentences (`s`) may be specified using the options `-te <element_name>` and `-se <element_name>` (or configuration options `token_element` and `sentence_element`).
104104

105-
A separate element for punctuation symbols can be applied instead of the common token element, such as the `pc` element defined by TEI Guidelines. This requires a specification of a rule to identify such special tokens. The rule can be specified using the option `-pc <rule>` (configuration option `punctuation`) and must have the form `<attribute_name>=<value>`. If the corresponding attribute with the provided value is found, the token will be annotated using the element `pc` instead of the common token element name. The value is tested (matched) as a regular expression. For UD annotation, the rule can be defined as `upos=PUNCT`. Non-compliant rule specification (e.g. not containing the symbol `=`) will be silently ignored. An element name for the punctuation symbols other than `pc` can also be specified using the option `-pe <element_name>` or configuration option `punctuation_element`.
105+
A separate element for punctuation symbols can be applied instead of the common token element, such as the `pc` element defined by TEI Guidelines. This requires a specification of a rule to identify such special tokens. The rule can be specified using the option `-pc <rule>` (configuration option `punctuation`) and must have the form `<attribute_name>=<value>`. If the corresponding attribute with the provided value is found, the token will be annotated using the element `pc` instead of the common token element name. The value is tested (matched) as a regular expression. For UD annotation, the rule can be defined as `upos=PUNCT`. Non-compliant rule specification (e.g. not containing the symbol `=`) will be silently ignored. An element name for the punctuation symbols other than `pc` can also be specified using the option `-pe <element_name>` or configuration option `punct_element`.
106106

107107
For annotation in the **CoNLL-U format** (Universal Dependencies), the number and role of the attributes is fixed. Their standard names are therefore already configured in the provided `xmlanntools.ini` configuration in the form of the profile called `conllu`. In addition, a special preprocessor (`conllu`) is applied to deal with the two-level tokenization generated by the UD parser. Since the virtual "syntactic words" do not really occur in the original text file, they cannot be annotated separately. Therefore, two options are available:
108108

@@ -198,7 +198,7 @@ By default, the script will automatically remove tags within the token string it
198198

199199
By default, the script will also **flatten any nested XML structures**, since nesting of elements of the same name is usually not supported by the search engines based on vertical format. At the beginning of any nested element with the same name as one of its parents, the parent element will be closed and a new element will be opened, merging its own attributes with the attributes of its parent: new attributes of the child will be appended and values of identical attributes will be concatenated. In addition, the child will get a new attribute `nesting_level` set to the level of nesting (starting with 1 for the first nested child level; the attribute name can be changed using the configuration option `vrt_flat_level_attribute`) - only the top-most parent will keep its original attributes only. At the end of the nested child element, its immediate parent will be reopened with its own attributes again. The default separator used for concatenation of attribute values (a single space by default) can be specified using the configuration option `vrt_flat_separator`, or more specifically `vrt_flat_separator_X_Y` for any particular attribute `Y` of any element `X`. Instead of concatenation, the values of children attributes may also override the values of the corresponding attributes of their parent completely. This can be activated generally by setting the configuration option `vrt_flat_override`, or specifically by the option `vrt_flat_override_X_Y` just for particular attributes `Y` of particular elements `X`. The flattening can also be completely deactivated using the option `-nf/--no-flattening` (configuration option `vrt_no_flattening`).
200200

201-
If there are text contents found within elements other than the specified token element (`w` by default, can be specified using the option `-te <name>`, configuration option `token_element`), the whole text fragments are output as single line "tokens" by default. Using the option `-df` (or `--discard-freetext`, configuration option `vrt_discard_freetext`) they will be completely discarded from the output.
201+
If there are text contents found within elements other than the specified token element (`w` by default, can be specified using the option `-te <name>`, configuration option `token_element`) or the specified punctuation element (`pc` by default, can be specified using the option `-pe <name>` or configuration option `punct_element`), the whole text fragments are output as single line "tokens" by default. By using the option `-df` (or `--discard-freetext`, configuration option `vrt_discard_freetext`), they will be completely discarded from the output.
202202

203203
By default, the whole root element of the XML file will be extracted into the vertical. If just some particular subelements should be extracted, they can be specified using the option `-i <element_names>` (where element names are again listed as a single, comma separated list without spaces) or the configuration option `vrt_include_elements` (where whitespace is allowed too). These elements are *not* expected to be nested within each other.
204204

ann2standoff

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -376,7 +376,7 @@ if __name__ == '__main__':
376376
Default: 'w'.
377377
378378
'-pe <element_name>' Name of the resulting XML elemenent for punctuation symbols.
379-
Config setting: 'punctuation_element'.
379+
Config setting: 'punct_element'.
380380
Default: 'pc'.
381381
382382
'-se <element_name>' Name of the resulting XML elemenent for sentences.
@@ -419,7 +419,7 @@ if __name__ == '__main__':
419419
parser.add_argument("-a", "--attributes", help="attribute names (except first position, separated by comma)", type=str)
420420
parser.add_argument("-ea", "--exclude-attributes", help="attribute names (except first position, separated by comma)", type=str)
421421
parser.add_argument("-te", "--token-element", help="name of token element", type=str)
422-
parser.add_argument("-pe", "--punctuation-element", help="name of punctuation element", type=str)
422+
parser.add_argument("-pe", "--punct-element", help="name of punctuation element", type=str)
423423
parser.add_argument("-pc", "--punctuation", help="identification of punctuation", type=str)
424424
parser.add_argument("-se", "--sentence-element", help="name of sentence element", type=str)
425425
parser.add_argument("-ne", "--ne-element", help="name of element marking named entities", type=str)

xml2vrt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -376,6 +376,10 @@ if __name__ == "__main__":
376376
Default: 'w'.
377377
Config setting: 'token_element'.
378378
379+
'-pe <element_name>' Name of the XML elemenent for punctuation symbols.
380+
Config setting: 'punct_element'.
381+
Default: 'pc'.
382+
379383
'-i <element_names>' Comma separated list of (sub)element names to be extracted into the output vertical
380384
(no spaces!). By default the whole root element of the XML document will be extracted.
381385
Config setting: 'vrt_include_elements' (may also be separated by spaces, commas, linebreaks or
@@ -403,6 +407,7 @@ if __name__ == "__main__":
403407
parser.add_argument("-p", "--profile", help="config profile to use", type=str, default='DEFAULT')
404408
parser.add_argument("-a", "--attributes", help="attribute names (except first position, separated by comma)", type=str)
405409
parser.add_argument("-te", "--token-element", help="name of token element", type=str)
410+
parser.add_argument("-pe", "--punct-element", help="name of punctuation element", type=str)
406411
parser.add_argument("-i", "--include-elements", help="(sub)elements to extract (default: document root)", type=str, dest='vrt_include_elements')
407412
parser.add_argument("-e", "--exclude-elements", help="elements to skip (exclude from the extraction)", type=str, dest='vrt_exclude_elements')
408413
parser.add_argument("-kt", "--keep-token-tags", help="keep tags within tokens", action="store_true", dest="vrt_keep_token_tags")

xmlanntools.ini

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ virtual_tokens = dtok
1818
punctuation = upos=PUNCT
1919
#virtual_token_attr = form
2020
#token_element = w
21-
#punctuation_element = pc
21+
#punct_element = pc
2222
#sentence_element = s
2323
#ne_element = ne
2424
#ne_type_attr = type

0 commit comments

Comments
 (0)