You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -102,7 +102,7 @@ If you want to *exclude* some attributes provided by the analysis from the resul
102
102
103
103
Names of the elements for tokens (`w`) and sentences (`s`) may be specified using the options `-te <element_name>` and `-se <element_name>` (or configuration options `token_element` and `sentence_element`).
104
104
105
-
A separate element for punctuation symbols can be applied instead of the common token element, such as the `pc` element defined by TEI Guidelines. This requires a specification of a rule to identify such special tokens. The rule can be specified using the option `-pc <rule>` (configuration option `punctuation`) and must have the form `<attribute_name>=<value>`. If the corresponding attribute with the provided value is found, the token will be annotated using the element `pc` instead of the common token element name. The value is tested (matched) as a regular expression. For UD annotation, the rule can be defined as `upos=PUNCT`. Non-compliant rule specification (e.g. not containing the symbol `=`) will be silently ignored. An element name for the punctuation symbols other than `pc` can also be specified using the option `-pe <element_name>` or configuration option `punctuation_element`.
105
+
A separate element for punctuation symbols can be applied instead of the common token element, such as the `pc` element defined by TEI Guidelines. This requires a specification of a rule to identify such special tokens. The rule can be specified using the option `-pc <rule>` (configuration option `punctuation`) and must have the form `<attribute_name>=<value>`. If the corresponding attribute with the provided value is found, the token will be annotated using the element `pc` instead of the common token element name. The value is tested (matched) as a regular expression. For UD annotation, the rule can be defined as `upos=PUNCT`. Non-compliant rule specification (e.g. not containing the symbol `=`) will be silently ignored. An element name for the punctuation symbols other than `pc` can also be specified using the option `-pe <element_name>` or configuration option `punct_element`.
106
106
107
107
For annotation in the **CoNLL-U format** (Universal Dependencies), the number and role of the attributes is fixed. Their standard names are therefore already configured in the provided `xmlanntools.ini` configuration in the form of the profile called `conllu`. In addition, a special preprocessor (`conllu`) is applied to deal with the two-level tokenization generated by the UD parser. Since the virtual "syntactic words" do not really occur in the original text file, they cannot be annotated separately. Therefore, two options are available:
108
108
@@ -198,7 +198,7 @@ By default, the script will automatically remove tags within the token string it
198
198
199
199
By default, the script will also **flatten any nested XML structures**, since nesting of elements of the same name is usually not supported by the search engines based on vertical format. At the beginning of any nested element with the same name as one of its parents, the parent element will be closed and a new element will be opened, merging its own attributes with the attributes of its parent: new attributes of the child will be appended and values of identical attributes will be concatenated. In addition, the child will get a new attribute `nesting_level` set to the level of nesting (starting with 1 for the first nested child level; the attribute name can be changed using the configuration option `vrt_flat_level_attribute`) - only the top-most parent will keep its original attributes only. At the end of the nested child element, its immediate parent will be reopened with its own attributes again. The default separator used for concatenation of attribute values (a single space by default) can be specified using the configuration option `vrt_flat_separator`, or more specifically `vrt_flat_separator_X_Y` for any particular attribute `Y` of any element `X`. Instead of concatenation, the values of children attributes may also override the values of the corresponding attributes of their parent completely. This can be activated generally by setting the configuration option `vrt_flat_override`, or specifically by the option `vrt_flat_override_X_Y` just for particular attributes `Y` of particular elements `X`. The flattening can also be completely deactivated using the option `-nf/--no-flattening` (configuration option `vrt_no_flattening`).
200
200
201
-
If there are text contents found within elements other than the specified token element (`w` by default, can be specified using the option `-te <name>`, configuration option `token_element`), the whole text fragments are output as single line "tokens" by default. Using the option `-df` (or `--discard-freetext`, configuration option `vrt_discard_freetext`) they will be completely discarded from the output.
201
+
If there are text contents found within elements other than the specified token element (`w` by default, can be specified using the option `-te <name>`, configuration option `token_element`) or the specified punctuation element (`pc` by default, can be specified using the option `-pe <name>` or configuration option `punct_element`), the whole text fragments are output as single line "tokens" by default. By using the option `-df` (or `--discard-freetext`, configuration option `vrt_discard_freetext`), they will be completely discarded from the output.
202
202
203
203
By default, the whole root element of the XML file will be extracted into the vertical. If just some particular subelements should be extracted, they can be specified using the option `-i <element_names>` (where element names are again listed as a single, comma separated list without spaces) or the configuration option `vrt_include_elements` (where whitespace is allowed too). These elements are *not* expected to be nested within each other.
0 commit comments