Name	Name	Last commit message	Last commit date
parent directory ..
ACL	ACL
LREC	LREC
README.md	README.md

Noisy Finetuning Data

ACL/

Parallel texts from TED2020 and X-News with synthetically injected random character noise at varying Character Error Rates (CER). Used in Stage B of the training pipeline.

Files

File	Rows	Columns	Description
`TED_data_random_noise_10k_sampled.csv`	10 000	`german`, `french`, `german_noise_random05`, `german_noise_random10`, `german_noise_random15`, `french_noise_random05`, `french_noise_random10`, `french_noise_random15`	10K stratified sample of TED with both languages in separate columns and noise at CER ≈ 5 %, 10 %, 15 %
`TED_data_random_noise_concat.csv`	348 658	`deu`, `deu_04`	Full TED corpus with German and French rows stacked vertically into a single column pair at CER ≈ 4 % (~2× the original 174K rows)
`X-News_data_random_noise_10k_sampled.csv`	10 000	`german`, `french`, `german_noise_random05`, …, `french_noise_random15`	10K stratified sample of X-News with both languages in separate columns and noise at CER ≈ 5 %, 10 %, 15 %

LREC/

Data supporting both stages of the LREC 2026 recipe. Stage A uses cross-lingual parallel Luxembourgish pairs (lb_de_training_set.jsonl, lb_en_training_set.jsonl, lb_fr_training_set.jsonl). Stage B uses OCR-noised historical newspaper articles (de_docs_random_noise.csv, fr_docs_random_noise.csv) and a noised sample from MLSum (query_doc_dataset_random_noise.csv).

The Luxembourgish parallel training data originates from the impresso/histlux_emb repository. You can download it from the prepared_training_sentences/ folder in that repository or from the HistLuxAlign dataset on Hugging Face.

Files

File	Rows	Format	Description
`lb_de_training_set.jsonl`	2 105	JSONL	Luxembourgish ↔ German parallel sentence pairs from historical newspapers (GPT-4o segmented & translated)
`lb_en_training_set.jsonl`	2 105	JSONL	Luxembourgish ↔ English parallel sentence pairs
`lb_fr_training_set.jsonl`	2 105	JSONL	Luxembourgish ↔ French parallel sentence pairs
`de_docs_random_noise.csv`	10 000	CSV	German historical newspaper articles with random noise (`deu` → `deu_04`)
`fr_docs_random_noise.csv`	10 000	CSV	French historical newspaper articles with random noise (`fra` → `fra_04`)
`query_doc_dataset_random_noise.csv`	10 000	CSV	10K sample from MLSum with random noise (`text`, `summary`, `query` → `text_04`, `summary_04`, `query_04`)

JSONL schema (parallel pairs)

Each line contains a custom_id and a translation array of sentence-aligned pairs:

{
  "custom_id": "task-0_deletz1893-1893-01-01-a-i0003",
  "translation": [
    {"lb": "Briefkasten.", "de": "Briefkasten."},
    {"lb": "bie crftc Mummet ...", "de": "die erste Nummer ..."}
  ]
}

Noise Generation

Random character noise was generated with generate_random_character_noise/ using script-specific confusable character tables (substitution, insertion, deletion, swap) at configurable CER levels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Noisy Finetuning Data

ACL/

Files

LREC/

Files

JSONL schema (parallel pairs)

Noise Generation

FilesExpand file tree

noisy_finetuning_data

Directory actions

More options

Directory actions

More options

Latest commit

History

noisy_finetuning_data

Folders and files

parent directory

README.md

Noisy Finetuning Data

ACL/

Files

LREC/

Files

JSONL schema (parallel pairs)

Noise Generation