[Tracking Issue] Integrate ICU4X to improve Unicode text handling in satori

## Background

satori's current text processing stack relies on several independent libraries and runtime APIs, each with known limitations:

- **`linebreak@1.1`**: Unicode support stalled at Unicode 13 (2020); the package has had no npm releases in over 4 years and does not correctly handle Emoji ZWJ sequences or complex script line breaking (see issues #687, #621)
- **`Intl.Segmenter`**: Relies on the JavaScript runtime implementation, which behaves inconsistently across V8, JSC, SpiderMonkey, and various Edge Runtimes — and is unsupported in some environments entirely
- **`emoji-regex-xs`**: Requires manual updates to track new Unicode versions; ICU4X's Emoji property query can cover the same functionality while eliminating this external dependency — happy to discuss with maintainers whether a replacement is worthwhile
- **Script detection in `language.ts`**: Hardcoded regexes per language, which does not scale well
- **Some capabilities are entirely absent** (e.g., BiDi / right-to-left text layout)

[ICU4X](https://github.com/unicode-org/icu4x) is the Unicode Consortium's next-generation internationalization library, already adopted by Firefox, Chrome, and Android. It provides a WebAssembly build that can address all of the above issues in a deterministic, runtime-agnostic way.

This issue tracks the incremental integration of ICU4X into satori via a companion `icu4satori` package.


## Proposed Approach

The core design principle is: **no breaking changes, fully backward compatible**.

An optional `textEngine?: TextEngine` field is added to `SatoriOptions`, allowing ICU4X to be injected as a plugin. Users who do not pass `textEngine` see no behavior change whatsoever.

```ts
import { init, createTextEngine } from 'icu4satori'
import satori from 'satori'

await init(wasmInput)
const textEngine = createTextEngine(new Uint8Array(dataBlob))

const svg = await satori(<div>สวัสดีชาวโลก</div>, {
  width: 600,
  height: 400,
  fonts: [...],
  textEngine, // opt in to ICU4X; omit to fall back to existing behavior
})
```

The ICU4X WASM binary (~96 KB) and Unicode data blob (~348 KB) are distributed as separate subpath exports within the `icu4satori` package and loaded on demand at runtime.


## Work Breakdown

### 🚧 Phase 1: Line Breaking Replacement (Work In Progress)

> A draft implementation is ready — see #744

**Problem**: `linebreak@1.1` is frozen at Unicode 13 and does not support:
- Emoji ZWJ sequences (e.g. 👨‍👩‍👧 cannot break correctly at line end)
- Word-level line breaking for Thai, Burmese, and Khmer (SA-class characters degrade to character-level breaking)
- Correct behavior of CSS `line-break: strict/loose` for CJK text

**Proposed solution**: Introduce the `icu4satori` package wrapping ICU4X `LineSegmenter` (UAX#14 v15.1+). Add an optional `textEngine?: TextEngine` field to `SatoriOptions` and route `splitByBreakOpportunities()` through `textEngine.getLineBreaks()` when provided.

**Planned deliverables**:
- `icu4satori` package: `init()` + `createTextEngine()` + `TextEngine` interface definition
- satori: end-to-end threading of the `textEngine?` option (`SatoriOptions` → `splitByBreakOpportunities()`)
- ICU4X `CodePointMapData8<LineBreak>` for mandatory break detection (UAX#14 LB4/LB5)
- Support for CSS `line-break` (Loose/Normal/Strict/Anywhere) and `word-break` (Normal/BreakAll/KeepAll)
- New tests covering Thai LSTM line breaking, CJK keep-all, and forced line breaks


## Roadmap (Blueprint)

The following phases are directional proposals. Implementation details, sequencing, and scope are open for discussion — feedback and suggestions from anyone in the community are welcome.

### Phase 2: Word & Grapheme Segmentation Replacement

**Problem**: The `segment()` function relies on `Intl.Segmenter`, which has the following issues:
- Incomplete or inconsistent `Intl.Segmenter` support in Cloudflare Workers, Deno Deploy, Bun, and other Edge Runtimes
- Thai word boundary results differ across engines, affecting `text-overflow: ellipsis` truncation positions
- `capitalize` text-transform depends on the accuracy of grapheme segmentation

**Proposed solution**:
- `textEngine.segmentWords?(text, locale)` → ICU4X `WordSegmenter` (UAX#29, LSTM model)
- `textEngine.segmentGraphemes?(text)` → ICU4X `GraphemeClusterSegmenter` (UAX#29)

**Design note**: `segmentWords?` and `segmentGraphemes?` are designed as optional methods on the `TextEngine` interface, sharing the same injection point as `getLineBreaks` — no new API surface is required.

**Affected areas**:
- `src/utils.ts`: `segment()` function
- `src/text/index.ts`: grapheme enumeration during missing-font detection
- `src/text/processor.ts`: word/grapheme splitting under `capitalize` mode
- WASM build: `WordSegmenter` and `GraphemeClusterSegmenter` symbols must be retained
- Data blob: Word/Grapheme markers need to be included (size impact to be measured)


### Phase 3: Unicode Properties Replacement

**Problems**:
1. `wordSeparators` hardcodes 8 code points (`[0x0020, 0x00a0, 0x1361, ...]`) — the set is incomplete and missing several Unicode whitespace characters
2. Emoji detection relies on `emoji-regex-xs` (requires manual updates per Unicode release) — `ICU4X CodePointSetData.loadEmoji()` can cover the same functionality and eliminate this external dependency, though whether to replace it is open for discussion
3. Symbol/Math detection uses JS regex `\p{Symbol}` / `\p{Math}`, which depends on engine implementation

**Proposed solution** (all via ICU4X Property API, zero additional WASM exports):
- `wordSeparators` → `CodePointSetData.loadWhiteSpace()`, or superseded by `WordSegmenter` boundary detection from Phase 2
- Emoji → `CodePointSetData.loadEmoji()` property query
- Symbol/Math → `GeneralCategory` enum + `CodePointSetData.loadMath()`


### Phase 4: Script Detection Refactor

**Problem**: `detectLanguageCode()` manually maintains `\p{scx=...}` regexes for each language, currently covering 14 languages/scripts:
- Adding a new language requires manually adding a regex — this does not scale
- No priority handling for multi-script characters (e.g. a Hiragana character matched against both Japanese and Han)
- The Unicode `ScriptExtensions` property (a character can belong to multiple scripts) is not accounted for at all

**Proposed solution**:
- `CodePointMapData16.loadScript()` to retrieve a character's primary script
- `ScriptExtensionsSet` to obtain the full script membership for multi-script characters
- A `Script → Locale` mapping table (Han → zh/ja ambiguity resolved with `ScriptExtensions` assistance)

**Benefits**: Support for 200+ scripts with no manual regex maintenance; improved locale code accuracy for `loadAdditionalAsset`.


### Phase 5: Case Mapping Replacement

**Problem**: `processTextTransform()` uses `toLocaleUpperCase(locale)` / `toLocaleLowerCase(locale)`:
- Special casing rules such as Turkish `ı ↔ I` and Greek `σ/ς` depend on the runtime Intl implementation
- `capitalize` mode currently requires word segmentation followed by per-grapheme uppercasing (`segment(word, 'grapheme') → grapheme[0].toLocaleUpperCase()`), which could be simplified

**Proposed solution**:
- `CaseMapper.lowercase(locale, text)` / `CaseMapper.uppercase(locale, text)`
- `TitlecaseMapper.titlecaseSegment(locale, text)` to implement `capitalize` directly


### Phase 6 (Long-term / Optional): New Capabilities

The following are capabilities that ICU4X can provide but satori currently lacks entirely — these are net-new features rather than replacements:

#### 6.1 BiDi (Bidirectional Text) Support
- Current satori code: `Yoga.calculateLayout(..., Yoga.DIRECTION_LTR)` is hardcoded to LTR; `text/index.ts` contains a `@TODO: Support RTL languages` comment
- ICU4X provides `BidiClass` properties + `Bidi` API (UAX#9)
- **This is high-complexity work**: it requires cooperation at the layout engine (Yoga) level, not just ICU4X integration
- Suggest tracking in a dedicated issue

#### 6.2 Text Normalization
- satori currently performs no Unicode normalization; composed characters (e.g. `e + ́ = é`) may cause font glyph misses
- ICU4X `ComposingNormalizer` (NFC) can normalize text before font lookup
- Suggest introducing this only in response to actual bug reports rather than proactively

#### 6.3 Locale Enhancement
- `normalizeLocale()` is currently implemented as a simple prefix match (e.g. `"zh"` → `"zh-CN"`)
- ICU4X `LocaleCanonicalizer` + `LocaleFallbacker` can provide standards-compliant BCP 47 handling
- Would improve locale code accuracy for `loadAdditionalAsset`
- Suggest completing this alongside Phase 4 as a complementary improvement


## Priority Ordering and Rationale

| Priority | Phase | Rationale |
|----------|-------|-----------|
| **P1** | Phase 1 — Line Breaking | Known bugs (#621, #687) affecting all users of Thai, Emoji, and CJK text |
| **P2** | Phase 2 — Word/Grapheme | Affects Edge Runtime compatibility; optional methods already reserved in `TextEngine` interface, making this a natural continuation of Phase 1 |
| **P3** | Phase 3 — Unicode Properties | Correctness improvements, but no known bugs in current implementation; qualifies as technical debt cleanup |
| **P3** | Phase 5 — Case Mapping | Same as above |
| **P4** | Phase 4 — Script Detection | Broader impact (language classification for `loadAdditionalAsset`), but high refactor complexity |
| **Long-term** | Phase 6 — BiDi / Normalization | New capabilities requiring broader architectural discussion |
| **Long-term** | Phase 6.3 — Locale Enhancement | Complementary to Phase 4; `LocaleCanonicalizer` improves locale code accuracy |


## Bundle Size Impact

| Phase | WASM | Data Blob | Notes |
|-------|------|-----------|-------|
| Phase 1 | ~96 KB | ~348 KB (auto) / ~29 KB (simple) | 3 datagen markers |
| After Phase 2 | incremental, TBD | TBD (Word + Grapheme markers) | rough estimate: +50–100 KB blob; to be measured |
| Full implementation | TBD | TBD | Depends on final set of enabled features |

All sizes are fully controllable via `ld.py` export symbol pruning + `icu4x-datagen --markers-for-bin` minimization.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking Issue] Integrate ICU4X to improve Unicode text handling in satori #743

Background

Proposed Approach

Work Breakdown

🚧 Phase 1: Line Breaking Replacement (Work In Progress)

Roadmap (Blueprint)

Phase 2: Word & Grapheme Segmentation Replacement

Phase 3: Unicode Properties Replacement

Phase 4: Script Detection Refactor

Phase 5: Case Mapping Replacement

Phase 6 (Long-term / Optional): New Capabilities

6.1 BiDi (Bidirectional Text) Support

6.2 Text Normalization

6.3 Locale Enhancement

Priority Ordering and Rationale

Bundle Size Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Priority	Phase	Rationale
P1	Phase 1 — Line Breaking	Known bugs (#621, #687) affecting all users of Thai, Emoji, and CJK text
P2	Phase 2 — Word/Grapheme	Affects Edge Runtime compatibility; optional methods already reserved in `TextEngine` interface, making this a natural continuation of Phase 1
P3	Phase 3 — Unicode Properties	Correctness improvements, but no known bugs in current implementation; qualifies as technical debt cleanup
P3	Phase 5 — Case Mapping	Same as above
P4	Phase 4 — Script Detection	Broader impact (language classification for `loadAdditionalAsset`), but high refactor complexity
Long-term	Phase 6 — BiDi / Normalization	New capabilities requiring broader architectural discussion
Long-term	Phase 6.3 — Locale Enhancement	Complementary to Phase 4; `LocaleCanonicalizer` improves locale code accuracy

Phase	WASM	Data Blob	Notes
Phase 1	~96 KB	~348 KB (auto) / ~29 KB (simple)	3 datagen markers
After Phase 2	incremental, TBD	TBD (Word + Grapheme markers)	rough estimate: +50–100 KB blob; to be measured
Full implementation	TBD	TBD	Depends on final set of enabled features

[Tracking Issue] Integrate ICU4X to improve Unicode text handling in satori #743

Description

Background

Proposed Approach

Work Breakdown

🚧 Phase 1: Line Breaking Replacement (Work In Progress)

Roadmap (Blueprint)

Phase 2: Word & Grapheme Segmentation Replacement

Phase 3: Unicode Properties Replacement

Phase 4: Script Detection Refactor

Phase 5: Case Mapping Replacement

Phase 6 (Long-term / Optional): New Capabilities

6.1 BiDi (Bidirectional Text) Support

6.2 Text Normalization

6.3 Locale Enhancement

Priority Ordering and Rationale

Bundle Size Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions