Skip to content

[Tracking Issue] Integrate ICU4X to improve Unicode text handling in satori #743

@Vizards

Description

@Vizards

Background

satori's current text processing stack relies on several independent libraries and runtime APIs, each with known limitations:

  • [email protected]: Unicode support stalled at Unicode 13 (2020); the package has had no npm releases in over 4 years and does not correctly handle Emoji ZWJ sequences or complex script line breaking (see issues [Proposal] Replacing linebreak with icu4x and a Possible Implementation #687, Update Emojis to include Unicode 15.0+ #621)
  • Intl.Segmenter: Relies on the JavaScript runtime implementation, which behaves inconsistently across V8, JSC, SpiderMonkey, and various Edge Runtimes — and is unsupported in some environments entirely
  • emoji-regex-xs: Requires manual updates to track new Unicode versions; ICU4X's Emoji property query can cover the same functionality while eliminating this external dependency — happy to discuss with maintainers whether a replacement is worthwhile
  • Script detection in language.ts: Hardcoded regexes per language, which does not scale well
  • Some capabilities are entirely absent (e.g., BiDi / right-to-left text layout)

ICU4X is the Unicode Consortium's next-generation internationalization library, already adopted by Firefox, Chrome, and Android. It provides a WebAssembly build that can address all of the above issues in a deterministic, runtime-agnostic way.

This issue tracks the incremental integration of ICU4X into satori via a companion icu4satori package.

Proposed Approach

The core design principle is: no breaking changes, fully backward compatible.

An optional textEngine?: TextEngine field is added to SatoriOptions, allowing ICU4X to be injected as a plugin. Users who do not pass textEngine see no behavior change whatsoever.

import { init, createTextEngine } from 'icu4satori'
import satori from 'satori'

await init(wasmInput)
const textEngine = createTextEngine(new Uint8Array(dataBlob))

const svg = await satori(<div>สวัสดีชาวโลก</div>, {
  width: 600,
  height: 400,
  fonts: [...],
  textEngine, // opt in to ICU4X; omit to fall back to existing behavior
})

The ICU4X WASM binary (~96 KB) and Unicode data blob (~348 KB) are distributed as separate subpath exports within the icu4satori package and loaded on demand at runtime.

Work Breakdown

🚧 Phase 1: Line Breaking Replacement (Work In Progress)

A draft implementation is ready — see #744

Problem: [email protected] is frozen at Unicode 13 and does not support:

  • Emoji ZWJ sequences (e.g. 👨‍👩‍👧 cannot break correctly at line end)
  • Word-level line breaking for Thai, Burmese, and Khmer (SA-class characters degrade to character-level breaking)
  • Correct behavior of CSS line-break: strict/loose for CJK text

Proposed solution: Introduce the icu4satori package wrapping ICU4X LineSegmenter (UAX#14 v15.1+). Add an optional textEngine?: TextEngine field to SatoriOptions and route splitByBreakOpportunities() through textEngine.getLineBreaks() when provided.

Planned deliverables:

  • icu4satori package: init() + createTextEngine() + TextEngine interface definition
  • satori: end-to-end threading of the textEngine? option (SatoriOptionssplitByBreakOpportunities())
  • ICU4X CodePointMapData8<LineBreak> for mandatory break detection (UAX#14 LB4/LB5)
  • Support for CSS line-break (Loose/Normal/Strict/Anywhere) and word-break (Normal/BreakAll/KeepAll)
  • New tests covering Thai LSTM line breaking, CJK keep-all, and forced line breaks

Roadmap (Blueprint)

The following phases are directional proposals. Implementation details, sequencing, and scope are open for discussion — feedback and suggestions from anyone in the community are welcome.

Phase 2: Word & Grapheme Segmentation Replacement

Problem: The segment() function relies on Intl.Segmenter, which has the following issues:

  • Incomplete or inconsistent Intl.Segmenter support in Cloudflare Workers, Deno Deploy, Bun, and other Edge Runtimes
  • Thai word boundary results differ across engines, affecting text-overflow: ellipsis truncation positions
  • capitalize text-transform depends on the accuracy of grapheme segmentation

Proposed solution:

  • textEngine.segmentWords?(text, locale) → ICU4X WordSegmenter (UAX#29, LSTM model)
  • textEngine.segmentGraphemes?(text) → ICU4X GraphemeClusterSegmenter (UAX#29)

Design note: segmentWords? and segmentGraphemes? are designed as optional methods on the TextEngine interface, sharing the same injection point as getLineBreaks — no new API surface is required.

Affected areas:

  • src/utils.ts: segment() function
  • src/text/index.ts: grapheme enumeration during missing-font detection
  • src/text/processor.ts: word/grapheme splitting under capitalize mode
  • WASM build: WordSegmenter and GraphemeClusterSegmenter symbols must be retained
  • Data blob: Word/Grapheme markers need to be included (size impact to be measured)

Phase 3: Unicode Properties Replacement

Problems:

  1. wordSeparators hardcodes 8 code points ([0x0020, 0x00a0, 0x1361, ...]) — the set is incomplete and missing several Unicode whitespace characters
  2. Emoji detection relies on emoji-regex-xs (requires manual updates per Unicode release) — ICU4X CodePointSetData.loadEmoji() can cover the same functionality and eliminate this external dependency, though whether to replace it is open for discussion
  3. Symbol/Math detection uses JS regex \p{Symbol} / \p{Math}, which depends on engine implementation

Proposed solution (all via ICU4X Property API, zero additional WASM exports):

  • wordSeparatorsCodePointSetData.loadWhiteSpace(), or superseded by WordSegmenter boundary detection from Phase 2
  • Emoji → CodePointSetData.loadEmoji() property query
  • Symbol/Math → GeneralCategory enum + CodePointSetData.loadMath()

Phase 4: Script Detection Refactor

Problem: detectLanguageCode() manually maintains \p{scx=...} regexes for each language, currently covering 14 languages/scripts:

  • Adding a new language requires manually adding a regex — this does not scale
  • No priority handling for multi-script characters (e.g. a Hiragana character matched against both Japanese and Han)
  • The Unicode ScriptExtensions property (a character can belong to multiple scripts) is not accounted for at all

Proposed solution:

  • CodePointMapData16.loadScript() to retrieve a character's primary script
  • ScriptExtensionsSet to obtain the full script membership for multi-script characters
  • A Script → Locale mapping table (Han → zh/ja ambiguity resolved with ScriptExtensions assistance)

Benefits: Support for 200+ scripts with no manual regex maintenance; improved locale code accuracy for loadAdditionalAsset.

Phase 5: Case Mapping Replacement

Problem: processTextTransform() uses toLocaleUpperCase(locale) / toLocaleLowerCase(locale):

  • Special casing rules such as Turkish ı ↔ I and Greek σ/ς depend on the runtime Intl implementation
  • capitalize mode currently requires word segmentation followed by per-grapheme uppercasing (segment(word, 'grapheme') → grapheme[0].toLocaleUpperCase()), which could be simplified

Proposed solution:

  • CaseMapper.lowercase(locale, text) / CaseMapper.uppercase(locale, text)
  • TitlecaseMapper.titlecaseSegment(locale, text) to implement capitalize directly

Phase 6 (Long-term / Optional): New Capabilities

The following are capabilities that ICU4X can provide but satori currently lacks entirely — these are net-new features rather than replacements:

6.1 BiDi (Bidirectional Text) Support

  • Current satori code: Yoga.calculateLayout(..., Yoga.DIRECTION_LTR) is hardcoded to LTR; text/index.ts contains a @TODO: Support RTL languages comment
  • ICU4X provides BidiClass properties + Bidi API (UAX#9)
  • This is high-complexity work: it requires cooperation at the layout engine (Yoga) level, not just ICU4X integration
  • Suggest tracking in a dedicated issue

6.2 Text Normalization

  • satori currently performs no Unicode normalization; composed characters (e.g. e + ́ = é) may cause font glyph misses
  • ICU4X ComposingNormalizer (NFC) can normalize text before font lookup
  • Suggest introducing this only in response to actual bug reports rather than proactively

6.3 Locale Enhancement

  • normalizeLocale() is currently implemented as a simple prefix match (e.g. "zh""zh-CN")
  • ICU4X LocaleCanonicalizer + LocaleFallbacker can provide standards-compliant BCP 47 handling
  • Would improve locale code accuracy for loadAdditionalAsset
  • Suggest completing this alongside Phase 4 as a complementary improvement

Priority Ordering and Rationale

Priority Phase Rationale
P1 Phase 1 — Line Breaking Known bugs (#621, #687) affecting all users of Thai, Emoji, and CJK text
P2 Phase 2 — Word/Grapheme Affects Edge Runtime compatibility; optional methods already reserved in TextEngine interface, making this a natural continuation of Phase 1
P3 Phase 3 — Unicode Properties Correctness improvements, but no known bugs in current implementation; qualifies as technical debt cleanup
P3 Phase 5 — Case Mapping Same as above
P4 Phase 4 — Script Detection Broader impact (language classification for loadAdditionalAsset), but high refactor complexity
Long-term Phase 6 — BiDi / Normalization New capabilities requiring broader architectural discussion
Long-term Phase 6.3 — Locale Enhancement Complementary to Phase 4; LocaleCanonicalizer improves locale code accuracy

Bundle Size Impact

Phase WASM Data Blob Notes
Phase 1 ~96 KB ~348 KB (auto) / ~29 KB (simple) 3 datagen markers
After Phase 2 incremental, TBD TBD (Word + Grapheme markers) rough estimate: +50–100 KB blob; to be measured
Full implementation TBD TBD Depends on final set of enabled features

All sizes are fully controllable via ld.py export symbol pruning + icu4x-datagen --markers-for-bin minimization.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions