Skip to content

Translation infrastructure: style guides, glossary audit, and robustness improvements#12401

Merged
zairro merged 9 commits intodevelopfrom
ae-translation-styleguides
Mar 3, 2026
Merged

Translation infrastructure: style guides, glossary audit, and robustness improvements#12401
zairro merged 9 commits intodevelopfrom
ae-translation-styleguides

Conversation

@atom-evens
Copy link
Contributor

@atom-evens atom-evens commented Mar 2, 2026

This PR makes changes to the auto-translation workflow.

Summary

This PR adds several improvements to the automated translation infrastructure: per-language style guides, a glossary audit system that compares our glossaries against the Braze platform UI source of truth, and robustness improvements for handling large files.

Changes

Per-language style guides (scripts/styleguides/*.md)

  • 6 new style guide files (de, es, fr, ja, ko, pt-br) with language-specific translation rules: grammatical gender conventions, register/tone, terminology preferences
  • Moves language-specific rules out of the shared translation_prompt.md into individual files so feedback from regional partners can be incorporated per-language without bloating the shared prompt
  • Wired into both translate and review passes via load_styleguide() in auto_translate.py
  • translation_prompt.md updated to reference appended style guides instead of embedding per-language rules inline

Glossary audit system

  • New script: scripts/audit_glossaries.py — Compares glossary entries against source-of-truth localization files from:
    • Platform dashboard (~24K string pairs per language)
    • Android SDK, Swift SDK, and GrapesJS locale files
    • Reports mismatches (glossary differs from UI) and missing high-value terms (appear in docs but not in glossary)
    • Outputs JSON + markdown reports
  • New workflow: .github/workflows/audit-glossaries.yml — Runs weekly (Sundays) or on-demand, clones reference repos, runs the audit, and creates a GitHub Issue with findings
    • Requires REFERENCE_REPO_TOKEN secret (fine-grained PAT with Contents read access to platform/SDK repos)

Glossary updates (sourced from platform UI)

  • "Everyone Else" added to all 6 glossaries with correct platform translations (e.g., ja: その他のユーザー, de: Alle anderen, ko: 다른 모든 사용자)
  • 33 clear-cut corrections where glossary translations differed from the platform UI (e.g., de: Aktions-Pfade → Aktionspfade, ko: 소프트웨어 개발 키트 → SDK, fr: custom events → Événements personnalisés)
  • ~1,030 new terms added across all 6 glossaries — high-frequency terms from the platform UI that appear in docs but were previously missing (e.g., Message, Custom, Required, Description, Export, Content)

Robustness improvements to auto_translate.py

  • Increased max output tokens from 16,384 to 65,536 (configurable via TRANSLATION_MAX_TOKENS) to prevent silent truncation of long files
  • Truncation detection — logs a warning when the API response hits the token limit
  • File-size safeguard — files exceeding 130 KB (configurable via TRANSLATION_MAX_FILE_KB) are skipped with clear logging and included in the PR summary
  • Switched to streaming API — uses client.messages.stream() instead of client.messages.create() to comply with Anthropic's requirements for large max_tokens operations
  • Temporarily disabled orphan cleanup step in the workflow during testing

Other

  • Added glossary_audit_report.* to .gitignore

Files changed

File Change
scripts/styleguides/*.md (6 files) New: per-language style guides
scripts/audit_glossaries.py New: glossary audit script
.github/workflows/audit-glossaries.yml New: weekly audit workflow
scripts/auto_translate.py Updated: streaming, tokens, style guides, file-size safeguard
scripts/translation_prompt.md Updated: references appended style guides
scripts/glossaries/*.json (6 files) Updated: +1,063 terms added, 33 corrected
.github/workflows/auto-translate.yml Updated: orphan cleanup temporarily disabled
.gitignore Updated: exclude audit reports

Setup required

After merging, add a REFERENCE_REPO_TOKEN secret to the repo (fine-grained GitHub PAT with Contents: Read-only access to Appboy/platform, braze-inc/grapesjs, braze-inc/braze-android-sdk, braze-inc/braze-swift-sdk).

Create scripts/styleguides/*.md with language-specific rules (gender
conventions, register, terminology preferences). Move per-language
rules out of the shared prompt into individual files. Wire style guides
into both translate and review passes via load_styleguide().

This makes it easy to add language-specific feedback (e.g., from the
Japan team) without bloating the shared translation prompt.

Made-with: Cursor
Fixes silent truncation of long files (e.g., integrations.md at 738 lines)
by quadrupling the token limit and logging a warning when the limit is hit.

Made-with: Cursor
Files exceeding 130 KB (configurable via TRANSLATION_MAX_FILE_KB) are
skipped with a warning in the workflow logs and listed in the PR summary
so they don't silently go untranslated.

Made-with: Cursor
The Anthropic SDK requires streaming for operations with high max_tokens
that may exceed 10 minutes. Replaces client.messages.create() with
client.messages.stream() to fix all-tasks-failing error.

Made-with: Cursor
…m UI

- New: scripts/audit_glossaries.py compares glossary entries against
  platform dashboard, Android SDK, Swift SDK, and GrapesJS locale files
  to detect mismatches and missing high-value terms.
- New: .github/workflows/audit-glossaries.yml runs the audit weekly
  (Sundays) or on-demand, creating a GitHub Issue with findings.
- Fix: Add "Everyone Else" to all 6 glossaries from platform source
  (e.g., ja: その他のユーザー, de: Alle anderen).
- Fix: 33 clear-cut glossary corrections where platform UI translations
  differed (e.g., de: Aktions-Pfade→Aktionspfade, ko: SDK kept as SDK).
- Add glossary_audit_report.* to .gitignore.

Made-with: Cursor
Terms sourced from the platform dashboard locale files that appear
frequently in docs content but were missing from the translation
glossaries. Deduplicated against existing entries (case-insensitive).

Per language: de +173, es +172, fr +171, ja +172, ko +169, pt-br +173.

Made-with: Cursor
@atom-evens atom-evens added status: done Work is done and ready to be merged. and removed do not merge status: in progress Work in progress. labels Mar 3, 2026
@atom-evens atom-evens marked this pull request as ready for review March 3, 2026 21:46
@atom-evens atom-evens requested a review from a team as a code owner March 3, 2026 21:46
Copilot AI review requested due to automatic review settings March 3, 2026 21:46
@github-actions github-actions bot requested a review from ats-91 March 3, 2026 21:47
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

🤖 Automated Reviewer Assignment: I have automatically added reviewers based on the following:

  • ⚖️ @ats-91 - PR diff contains privacy/legal/GDPR-related content

@atom-evens atom-evens changed the title Per-language style guides and translation rule updates Translation infrastructure: style guides, glossary audit, and robustness improvements Mar 3, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the translation prompting system to support per-language style guidance and expands glossary/automation tooling to improve translation consistency across docs.

Changes:

  • Added per-language style guide markdown files under scripts/styleguides/ and updated the shared translation prompt to defer language-specific rules to the appended style guide.
  • Updated scripts/auto_translate.py to load and append language style guides (and to skip translating oversized files with reporting in the PR summary).
  • Expanded multiple language glossaries and introduced a new glossary-audit script + scheduled workflow to detect drift vs source UI localization files.

Reviewed changes

Copilot reviewed 14 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
scripts/translation_prompt.md Removes embedded language-specific rules and references appended language style guides.
scripts/styleguides/pt-br.md Adds pt-BR-specific gender/register rules.
scripts/styleguides/ko.md Adds Korean register/tone guidance.
scripts/styleguides/ja.md Adds Japanese register/tone guidance.
scripts/styleguides/fr.md Adds French brand-article guidance + register/tone.
scripts/styleguides/es.md Adds Spanish brand-article guidance + terminology + register/tone.
scripts/styleguides/de.md Adds German brand-article guidance + register/tone.
scripts/glossaries/pt-br.json Adds many new pt-BR glossary entries for UI/term consistency.
scripts/glossaries/ko.json Adds many new Korean glossary entries for UI/term consistency.
scripts/glossaries/ja.json Adds many new Japanese glossary entries for UI/term consistency.
scripts/glossaries/fr.json Adds many new French glossary entries for UI/term consistency.
scripts/glossaries/es.json Adds many new Spanish glossary entries for UI/term consistency.
scripts/glossaries/de.json Adds many new German glossary entries for UI/term consistency.
scripts/auto_translate.py Appends style guides to translate/review prompts; adds file-size skipping and summary reporting; switches Claude calls to streaming.
scripts/audit_glossaries.py New script to compare doc glossaries against platform/SDK/GrapesJS locale sources and generate JSON/MD reports.
.gitignore Ignores generated glossary audit reports.
.github/workflows/auto-translate.yml Comments out orphaned-translation cleanup step in the auto-translate workflow.
.github/workflows/audit-glossaries.yml New scheduled workflow to run glossary audits, then open/rotate issues when findings exist.

Introduces structural quality checks that run after translation and
before build verification. Auto-repairs front matter, code blocks, and
URLs; flags Liquid tag mismatches, glossary compliance, completeness,
and untranslated blocks. Results are summarized in the PR body.

Lazy-loads the Anthropic SDK so qc and summary commands work without it.

Made-with: Cursor
Remove non-breaking spaces (U+00A0) from glossary values in fr.json
(4 entries), de.json (1 entry), and es.json (1 trailing space). Add
explanatory comment for disabled orphan cleanup step in workflow.

Made-with: Cursor
Copy link
Contributor

@zairro zairro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@zairro zairro merged commit 7fdb1dc into develop Mar 3, 2026
10 checks passed
@zairro zairro deleted the ae-translation-styleguides branch March 3, 2026 23:14
bre-fitzgerald pushed a commit that referenced this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: done Work is done and ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants