Khmer Neural Segmenter

A fast Khmer word segmentation library.

Installation

pip install khmerns

Usage

from khmerns import tokenize, normalize

# Returns a list of words
words = tokenize("សួស្តីបងប្អូន")
# => ['សួស្តី', 'បង', 'ប្អូន']

# normalize and reorder Khmer characters
words = tokenize(normalize("សួស្តីបងប្អូន"))
# => ['សួស្តី', 'បង', 'ប្អូន']

You can also use the class-based API if you prefer:

from khmerns import KhmerSegmenter

segmenter = KhmerSegmenter()

words = segmenter.tokenize("សួស្តីបងប្អូន")
# or

words = segmenter("សួស្តីបងប្អូន")

Training

The training pipeline lives in the training/ directory. It trains a BiGRU + CRF model on character-level BIO tags, then converts the result to GGUF for the C++ inference backend.

Data format

Training data is a plain text file at training/data/train.txt. One word per line. Words that appear on consecutive lines are treated as part of the same sentence. The model learns word boundaries from this.

Example training/data/train.txt:

សួស្តី
បង
ប្អូន
ខ្ញុំ
ទៅ
ផ្សារ

Non-Khmer tokens (spaces, punctuation, numbers, Latin text) are tagged as NON-KHMER. Khmer tokens get B-WORD on the first character and I-WORD on the rest.

Steps

cd training
pip install -r requirements.txt

1. Prepare training data

Place your segmented text in data/train.txt (one word per line). If you have raw unsegmented Khmer text, you can use the generation script to pre-segment it:

python generate.py

This requires khmersegment and a source text file. Edit the path in generate.py to point to your raw text.

2. Train

python train.py

Trains for 20 epochs with AdamW (lr=1e-5) and ReduceLROnPlateau. Saves best_model.pt (best eval loss) and model.pt (final). Uses CUDA if available.

3. Convert to GGUF

python convert_to_gguf.py best_model.pt model.gguf

This produces a GGUF file (~3.3MB) containing all model weights.

4. Embed in the C++ binary

To use the new model in the library, convert the GGUF file to a C header and replace src/model_data.h, then rebuild:

xxd -i model.gguf > ../src/model_data.h
pip install -e ..

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
img		img
src		src
test		test
training		training
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Khmer Neural Segmenter

Installation

Usage

Training

Data format

Steps

License

About

Uh oh!

Releases 4

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Khmer Neural Segmenter

Installation

Usage

Training

Data format

Steps

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Contributors

Uh oh!

Languages