Skip to content

Add Ukrainian stemmer #265

@polaz

Description

@polaz

Summary

Add a new stemming algorithm for the Ukrainian language.

Implementation

The stemmer follows the general design of the existing Russian stemmer, adapted for Ukrainian morphology:

  • Regions: pV (after first vowel), R2 (standard Snowball definition) used to guard aggressive stripping
  • Suffix removal order: perfective gerund → reflexive → verbal noun → (professional / diminutive) → adjective / verb / noun → diminutive stem → professional → derivational → tidy-up
  • Compound suffix patterns: Single-step removal of derivational+inflectional combinations (e.g. -аційний, -ічний, -ійний, -уальний, -онний) protected by R2
  • Ukrainian-specific features: apostrophe handling (U+0027, U+02BC), soft sign cleanup, superlative най- prefix, comparative -іш- suffix

Design decisions

  • Single-char adjective endings removed: Unlike some approaches that include , , etc. in the adjective section, these are handled by the noun section to avoid stealing endings from verb/noun patterns (matches Russian stemmer approach).
  • Professional before diminutive: Professional suffixes (-ар, -ник, -ельник, etc.) are tried before diminutive (-ик, -ок, etc.) to ensure будівельник → будів rather than diminutive -ик intercepting first.
  • R2 guards compound suffixes: Long derivational suffixes like -ація, -ічний only fire in R2 to prevent over-stemming short words (e.g. акація → акаці not ак).
  • Verbal noun with R2: The verbal noun suffixes (-ання, -іння, -ення) use R2 to protect short words (e.g. мигтіння, боління) from over-stemming.

Testing

  • 216 hand-crafted test cases covering nouns, adjectives, verbs, diminutives, professionals, verbal nouns, compound derivational patterns, reflexives, gerunds, participles, and negative tests
  • 57,868-word vocabulary from Ukrainian Wikipedia (frequency threshold 300) — all pass

References

  • Vysotska, V. (2024). "Methods and means of NLP for Ukrainian morphological analysis" — used for comparison of compound derivational stems
  • Based on structural patterns of the existing Russian and Serbian Snowball stemmers

PRs

Matching PRs will be submitted to snowball-data and snowball-website repositories with branch name matching this repo's PR branch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions