-
Notifications
You must be signed in to change notification settings - Fork 196
Open
Description
Summary
Add a new stemming algorithm for the Ukrainian language.
Implementation
The stemmer follows the general design of the existing Russian stemmer, adapted for Ukrainian morphology:
- Regions: pV (after first vowel), R2 (standard Snowball definition) used to guard aggressive stripping
- Suffix removal order: perfective gerund → reflexive → verbal noun → (professional / diminutive) → adjective / verb / noun → diminutive stem → professional → derivational → tidy-up
- Compound suffix patterns: Single-step removal of derivational+inflectional combinations (e.g.
-аційний,-ічний,-ійний,-уальний,-онний) protected by R2 - Ukrainian-specific features: apostrophe handling (U+0027, U+02BC), soft sign cleanup, superlative
най-prefix, comparative-іш-suffix
Design decisions
- Single-char adjective endings removed: Unlike some approaches that include
-а,-е,-іetc. in the adjective section, these are handled by the noun section to avoid stealing endings from verb/noun patterns (matches Russian stemmer approach). - Professional before diminutive: Professional suffixes (
-ар,-ник,-ельник, etc.) are tried before diminutive (-ик,-ок, etc.) to ensureбудівельник → будівrather than diminutive-икintercepting first. - R2 guards compound suffixes: Long derivational suffixes like
-ація,-ічнийonly fire in R2 to prevent over-stemming short words (e.g.акація → акаціnotак). - Verbal noun with R2: The verbal noun suffixes (
-ання,-іння,-ення) use R2 to protect short words (e.g.мигтіння,боління) from over-stemming.
Testing
- 216 hand-crafted test cases covering nouns, adjectives, verbs, diminutives, professionals, verbal nouns, compound derivational patterns, reflexives, gerunds, participles, and negative tests
- 57,868-word vocabulary from Ukrainian Wikipedia (frequency threshold 300) — all pass
References
- Vysotska, V. (2024). "Methods and means of NLP for Ukrainian morphological analysis" — used for comparison of compound derivational stems
- Based on structural patterns of the existing Russian and Serbian Snowball stemmers
PRs
Matching PRs will be submitted to snowball-data and snowball-website repositories with branch name matching this repo's PR branch.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels