Lrec orc recipe pr by psychias · Pull Request #2 · impresso/ocr-robust-multilingual-embeddings

psychias · 2026-04-02T12:26:54Z

This PR adds the resources and code for the LREC OCR-robust multilingual embeddings recipe.

What's included

Datasets

ACL — clean CLSD and STS17 evaluation sets
ACL — noisy CLSD evaluation sets (BLDS, MN, SNP) + Luxembourgish bitext mining tasks
ACL — TED and X-News noisy training data
LREC — Luxembourgish parallel pairs (lb↔de/en/fr) and historical OCR-noised documents (de, fr)
Reorganized noisy evaluation CSVs into ACL/ subdirectory

Code

generate_random_character_noise_latin_alphabet — multi-script synthetic OCR noise generator (Latin, Cyrillic, Greek, Arabic, Hebrew, Georgian)
ocr_simulator — scripts for applying OCR simulation to MIRACL and MLDR datasets
sample_training.ipynb — end-to-end two-stage fine-tuning notebook (Stage A: cross-lingual alignment, Stage B: OCR-noise robustness)

- Two-stage training script (adapt_model.py) with JSON config - Evaluation script (evaluate_embedding_model.py) for CLSD, STS, bitext mining - Sample training notebook and config - Noisy finetuning data (sampled/concat variants, LREC historical articles) - Clean and noisy CLSD evaluation datasets - STS-17 cross-lingual evaluation datasets - Multi-script random character noise generator - Large files (bitext JSONL, full TED/X-News CSVs) excluded - see READMEs for download links

… rename noise script folder - Migrate sample_training.ipynb and adapt_model.py to SentenceTransformerTrainer - Replace deprecated InputExample/DataLoader/model.fit with HFDataset/Trainer - Use bf16 only on Ampere+ GPUs (fixes CUDA assert on T4) - Replace warmup_ratio with warmup_steps (Transformers v5+ deprecation) - Rename generate_random_character_noise_latin_alphabet -> generate_random_character_noise - Remove torch CPU pins from requirements.txt (Colab compatibility) - Add accelerate>=0.26.0 to requirements.txt

psychias force-pushed the lrec_orc_recipe_pr branch from 43822b5 to 354bf49 Compare April 8, 2026 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lrec orc recipe pr#2

Lrec orc recipe pr#2
psychias wants to merge 2 commits intoimpresso:mainfrom
psychias:lrec_orc_recipe_pr

psychias commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

psychias commented Apr 2, 2026

What's included

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants