You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Intl.Segmenter: Relies on the JavaScript runtime implementation, which behaves inconsistently across V8, JSC, SpiderMonkey, and various Edge Runtimes — and is unsupported in some environments entirely
emoji-regex-xs: Requires manual updates to track new Unicode versions; ICU4X's Emoji property query can cover the same functionality while eliminating this external dependency — happy to discuss with maintainers whether a replacement is worthwhile
Script detection in language.ts: Hardcoded regexes per language, which does not scale well
Some capabilities are entirely absent (e.g., BiDi / right-to-left text layout)
ICU4X is the Unicode Consortium's next-generation internationalization library, already adopted by Firefox, Chrome, and Android. It provides a WebAssembly build that can address all of the above issues in a deterministic, runtime-agnostic way.
This issue tracks the incremental integration of ICU4X into satori via a companion icu4satori package.
Proposed Approach
The core design principle is: no breaking changes, fully backward compatible.
An optional textEngine?: TextEngine field is added to SatoriOptions, allowing ICU4X to be injected as a plugin. Users who do not pass textEngine see no behavior change whatsoever.
import{init,createTextEngine}from'icu4satori'importsatorifrom'satori'awaitinit(wasmInput)consttextEngine=createTextEngine(newUint8Array(dataBlob))constsvg=awaitsatori(<div>สวัสดีชาวโลก</div>,{width: 600,height: 400,fonts: [...],textEngine,// opt in to ICU4X; omit to fall back to existing behavior})
The ICU4X WASM binary (~96 KB) and Unicode data blob (~348 KB) are distributed as separate subpath exports within the icu4satori package and loaded on demand at runtime.
Work Breakdown
🚧 Phase 1: Line Breaking Replacement (Work In Progress)
Problem: [email protected] is frozen at Unicode 13 and does not support:
Emoji ZWJ sequences (e.g. 👨👩👧 cannot break correctly at line end)
Word-level line breaking for Thai, Burmese, and Khmer (SA-class characters degrade to character-level breaking)
Correct behavior of CSS line-break: strict/loose for CJK text
Proposed solution: Introduce the icu4satori package wrapping ICU4X LineSegmenter (UAX#14 v15.1+). Add an optional textEngine?: TextEngine field to SatoriOptions and route splitByBreakOpportunities() through textEngine.getLineBreaks() when provided.
satori: end-to-end threading of the textEngine? option (SatoriOptions → splitByBreakOpportunities())
ICU4X CodePointMapData8<LineBreak> for mandatory break detection (UAX#14 LB4/LB5)
Support for CSS line-break (Loose/Normal/Strict/Anywhere) and word-break (Normal/BreakAll/KeepAll)
New tests covering Thai LSTM line breaking, CJK keep-all, and forced line breaks
Roadmap (Blueprint)
The following phases are directional proposals. Implementation details, sequencing, and scope are open for discussion — feedback and suggestions from anyone in the community are welcome.
Phase 2: Word & Grapheme Segmentation Replacement
Problem: The segment() function relies on Intl.Segmenter, which has the following issues:
Incomplete or inconsistent Intl.Segmenter support in Cloudflare Workers, Deno Deploy, Bun, and other Edge Runtimes
Thai word boundary results differ across engines, affecting text-overflow: ellipsis truncation positions
capitalize text-transform depends on the accuracy of grapheme segmentation
Design note: segmentWords? and segmentGraphemes? are designed as optional methods on the TextEngine interface, sharing the same injection point as getLineBreaks — no new API surface is required.
Affected areas:
src/utils.ts: segment() function
src/text/index.ts: grapheme enumeration during missing-font detection
src/text/processor.ts: word/grapheme splitting under capitalize mode
WASM build: WordSegmenter and GraphemeClusterSegmenter symbols must be retained
Data blob: Word/Grapheme markers need to be included (size impact to be measured)
Phase 3: Unicode Properties Replacement
Problems:
wordSeparators hardcodes 8 code points ([0x0020, 0x00a0, 0x1361, ...]) — the set is incomplete and missing several Unicode whitespace characters
Emoji detection relies on emoji-regex-xs (requires manual updates per Unicode release) — ICU4X CodePointSetData.loadEmoji() can cover the same functionality and eliminate this external dependency, though whether to replace it is open for discussion
Symbol/Math detection uses JS regex \p{Symbol} / \p{Math}, which depends on engine implementation
Proposed solution (all via ICU4X Property API, zero additional WASM exports):
wordSeparators → CodePointSetData.loadWhiteSpace(), or superseded by WordSegmenter boundary detection from Phase 2
Special casing rules such as Turkish ı ↔ I and Greek σ/ς depend on the runtime Intl implementation
capitalize mode currently requires word segmentation followed by per-grapheme uppercasing (segment(word, 'grapheme') → grapheme[0].toLocaleUpperCase()), which could be simplified
Background
satori's current text processing stack relies on several independent libraries and runtime APIs, each with known limitations:
[email protected]: Unicode support stalled at Unicode 13 (2020); the package has had no npm releases in over 4 years and does not correctly handle Emoji ZWJ sequences or complex script line breaking (see issues [Proposal] Replacinglinebreakwithicu4xand a Possible Implementation #687, Update Emojis to include Unicode 15.0+ #621)Intl.Segmenter: Relies on the JavaScript runtime implementation, which behaves inconsistently across V8, JSC, SpiderMonkey, and various Edge Runtimes — and is unsupported in some environments entirelyemoji-regex-xs: Requires manual updates to track new Unicode versions; ICU4X's Emoji property query can cover the same functionality while eliminating this external dependency — happy to discuss with maintainers whether a replacement is worthwhilelanguage.ts: Hardcoded regexes per language, which does not scale wellICU4X is the Unicode Consortium's next-generation internationalization library, already adopted by Firefox, Chrome, and Android. It provides a WebAssembly build that can address all of the above issues in a deterministic, runtime-agnostic way.
This issue tracks the incremental integration of ICU4X into satori via a companion
icu4satoripackage.Proposed Approach
The core design principle is: no breaking changes, fully backward compatible.
An optional
textEngine?: TextEnginefield is added toSatoriOptions, allowing ICU4X to be injected as a plugin. Users who do not passtextEnginesee no behavior change whatsoever.The ICU4X WASM binary (~96 KB) and Unicode data blob (~348 KB) are distributed as separate subpath exports within the
icu4satoripackage and loaded on demand at runtime.Work Breakdown
🚧 Phase 1: Line Breaking Replacement (Work In Progress)
Problem:
[email protected]is frozen at Unicode 13 and does not support:line-break: strict/loosefor CJK textProposed solution: Introduce the
icu4satoripackage wrapping ICU4XLineSegmenter(UAX#14 v15.1+). Add an optionaltextEngine?: TextEnginefield toSatoriOptionsand routesplitByBreakOpportunities()throughtextEngine.getLineBreaks()when provided.Planned deliverables:
icu4satoripackage:init()+createTextEngine()+TextEngineinterface definitiontextEngine?option (SatoriOptions→splitByBreakOpportunities())CodePointMapData8<LineBreak>for mandatory break detection (UAX#14 LB4/LB5)line-break(Loose/Normal/Strict/Anywhere) andword-break(Normal/BreakAll/KeepAll)Roadmap (Blueprint)
The following phases are directional proposals. Implementation details, sequencing, and scope are open for discussion — feedback and suggestions from anyone in the community are welcome.
Phase 2: Word & Grapheme Segmentation Replacement
Problem: The
segment()function relies onIntl.Segmenter, which has the following issues:Intl.Segmentersupport in Cloudflare Workers, Deno Deploy, Bun, and other Edge Runtimestext-overflow: ellipsistruncation positionscapitalizetext-transform depends on the accuracy of grapheme segmentationProposed solution:
textEngine.segmentWords?(text, locale)→ ICU4XWordSegmenter(UAX#29, LSTM model)textEngine.segmentGraphemes?(text)→ ICU4XGraphemeClusterSegmenter(UAX#29)Design note:
segmentWords?andsegmentGraphemes?are designed as optional methods on theTextEngineinterface, sharing the same injection point asgetLineBreaks— no new API surface is required.Affected areas:
src/utils.ts:segment()functionsrc/text/index.ts: grapheme enumeration during missing-font detectionsrc/text/processor.ts: word/grapheme splitting undercapitalizemodeWordSegmenterandGraphemeClusterSegmentersymbols must be retainedPhase 3: Unicode Properties Replacement
Problems:
wordSeparatorshardcodes 8 code points ([0x0020, 0x00a0, 0x1361, ...]) — the set is incomplete and missing several Unicode whitespace charactersemoji-regex-xs(requires manual updates per Unicode release) —ICU4X CodePointSetData.loadEmoji()can cover the same functionality and eliminate this external dependency, though whether to replace it is open for discussion\p{Symbol}/\p{Math}, which depends on engine implementationProposed solution (all via ICU4X Property API, zero additional WASM exports):
wordSeparators→CodePointSetData.loadWhiteSpace(), or superseded byWordSegmenterboundary detection from Phase 2CodePointSetData.loadEmoji()property queryGeneralCategoryenum +CodePointSetData.loadMath()Phase 4: Script Detection Refactor
Problem:
detectLanguageCode()manually maintains\p{scx=...}regexes for each language, currently covering 14 languages/scripts:ScriptExtensionsproperty (a character can belong to multiple scripts) is not accounted for at allProposed solution:
CodePointMapData16.loadScript()to retrieve a character's primary scriptScriptExtensionsSetto obtain the full script membership for multi-script charactersScript → Localemapping table (Han → zh/ja ambiguity resolved withScriptExtensionsassistance)Benefits: Support for 200+ scripts with no manual regex maintenance; improved locale code accuracy for
loadAdditionalAsset.Phase 5: Case Mapping Replacement
Problem:
processTextTransform()usestoLocaleUpperCase(locale)/toLocaleLowerCase(locale):ı ↔ Iand Greekσ/ςdepend on the runtime Intl implementationcapitalizemode currently requires word segmentation followed by per-grapheme uppercasing (segment(word, 'grapheme') → grapheme[0].toLocaleUpperCase()), which could be simplifiedProposed solution:
CaseMapper.lowercase(locale, text)/CaseMapper.uppercase(locale, text)TitlecaseMapper.titlecaseSegment(locale, text)to implementcapitalizedirectlyPhase 6 (Long-term / Optional): New Capabilities
The following are capabilities that ICU4X can provide but satori currently lacks entirely — these are net-new features rather than replacements:
6.1 BiDi (Bidirectional Text) Support
Yoga.calculateLayout(..., Yoga.DIRECTION_LTR)is hardcoded to LTR;text/index.tscontains a@TODO: Support RTL languagescommentBidiClassproperties +BidiAPI (UAX#9)6.2 Text Normalization
e + ́ = é) may cause font glyph missesComposingNormalizer(NFC) can normalize text before font lookup6.3 Locale Enhancement
normalizeLocale()is currently implemented as a simple prefix match (e.g."zh"→"zh-CN")LocaleCanonicalizer+LocaleFallbackercan provide standards-compliant BCP 47 handlingloadAdditionalAssetPriority Ordering and Rationale
TextEngineinterface, making this a natural continuation of Phase 1loadAdditionalAsset), but high refactor complexityLocaleCanonicalizerimproves locale code accuracyBundle Size Impact
All sizes are fully controllable via
ld.pyexport symbol pruning +icu4x-datagen --markers-for-binminimization.