Refactor formula preservation into unified service layer by xiaocang · Pull Request #119 · xiaocang/easydict_win32

xiaocang · 2026-04-05T13:41:28Z

Summary

This PR consolidates formula detection, protection, restoration, and fallback logic into a new IContentPreservationService interface and FormulaPreservationService implementation. This refactoring extracts scattered heuristics from LongDocumentTranslationService into a reusable, testable service that separates evidence-based detection from policy decisions.

Key Changes

New Content Preservation Service Layer
- Added IContentPreservationService interface with three core operations: Analyze(), Protect(), and Restore()
- Implemented FormulaPreservationService consolidating all formula detection heuristics (block type, font-based, character-based, subscript density)
- Created BlockContext and ProtectionPlan models to decouple evidence from policy
Unified Math Pattern Detection
- Extracted MathPatterns static class as single source of truth for math font and Unicode regex patterns
- Fixed word-boundary anchoring on short abbreviations (BL, RM, EU, LA, RS) to prevent false positives in common text fonts (e.g., "Lato-Regular", "TimesNewRoman")
- Updated CharacterParagraphBuilder, LongDocumentTranslationService, and WinUI extraction to use shared patterns
Two-Tier Formula Protection
- Added ProtectTwoTier() method to FormulaProtector for confidence-based output:
  - High-confidence matches → {vN} hard placeholders
  - Low-confidence matches → $...$ inline LaTeX for LLM to decide
- Implemented FormulaDetector.IsHighConfidence() to classify token types
Improved Restoration Fallback
- Enhanced FormulaRestorer with graduated fallback policy:
  - All placeholders present → full restore with validation
  - ≥50% present → partial restore (replace available, omit missing)
  - <50% present → fall back to original text
- Prevents silent data loss when LLM drops formula placeholders
Character-Level Formula Detection
- Added FormulaLatexReconstructor for reconstructing LaTeX from character-level PDF data
- Detects subscripts/superscripts from font size and baseline position
- Reverse-maps Greek/operator Unicode to LaTeX commands
- Integrated character-level protection into SourceDocumentBlock and IR building
Refactored LongDocumentTranslationService
- Changed BuildIrAsync() and ApplyFormulaProtectionAsync() from static to instance methods
- Injected IContentPreservationService dependency
- Replaced inline detection logic with _preservation.Analyze() and _preservation.Protect() calls
- Preserved character-level tokens through IR pipeline
Comprehensive Test Coverage
- Added FormulaPreservationServiceTests covering character-level preference, fallback behavior, and opaque block detection
- Added FormulaConfidenceTests for two-tier protection validation
- Added MathPatternsTests verifying math font/Unicode detection and false-positive prevention
- Added FormulaLatexReconstructorTests for subscript/superscript and Unicode→LaTeX mapping
- Updated CharacterParagraphBuilderTests and FormulaDetectionTests with false-positive prevention cases

Notable Implementation Details

Preference Hierarchy: Character-level evidence (from PDF character analysis) is preferred over regex-based detection when available
Backward Compatibility: FormulaProtector.Protect() overload maintains existing behavior; new ProtectTwoTier() is opt-in
Single Source of Truth: Math patterns centralized in MathPatterns to prevent regex drift across multiple files
Graduated Fallback: Restoration no longer requires 100% placeholder presence; partial restoration at ≥50% threshold reduces data loss risk

https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc

… eliminate full-block fallback Root causes addressed: 1. FormulaRestorer fell back to untranslated originalText when ANY placeholder was missing — now uses graduated fallback (full/partial/original based on the fraction of placeholders present) 2. CharacterParagraphBuilder MathFontRegex lacked word boundaries on BL|RM|EU|LA|RS, causing common fonts like Lato-Regular and TimesNewRoman to false-positive 3. Vertical text matrix (tm.A==0 && tm.D==0) unconditionally classified as formula — now requires math font or math Unicode signal 4. MathUnicodeRegex included U+2000-U+200B general spaces as formula signals — narrowed to U+200B-U+200D (ZWSP/ZWNJ/ZWJ only) 5. Token map from formula protection phase was discarded and re-generated during restoration — now stored in DocumentBlockIr.FormulaTokenMap Architecture: - Extract ContentPreservation abstraction layer (IContentPreservationService) separating evidence (detection) from policy (skip/protect/restore/fallback) - FormulaPreservationService consolidates detection heuristics that were scattered across LongDocumentTranslationService - LongDocumentTranslationService delegates to IContentPreservationService, keeping only orchestration logic - Old internal static methods kept as thin wrappers for test compatibility https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc

…ider (Phase 2) Phase 2 of the formula detection overhaul: 1. Extract shared MathPatterns constant (MathFontPattern + MathUnicodePattern) - Single source of truth used by all 3 consumers - Fixes MathFontRegex bug in WinUI LongDocumentTranslationService (missing word boundaries on BL|RM|EU|LA|RS — 3rd copy of the regex) 2. Add character-level protection fields to the data pipeline: - SourceDocumentBlock: CharacterLevelProtectedText, CharacterLevelTokens - DocumentBlockIr: same fields for flow through IR - BlockContext: same fields for ContentPreservation service 3. Wire CharacterParagraphBuilder into WinUI block extraction: - New BuildCharacterLevelProtection() helper converts PdfPig Letters to CharInfo[], runs CharacterParagraphBuilder.Build(), and extracts protected text with {v*} placeholders + FormulaToken list - Called at both block extraction sites (ML-detected and heuristic) 4. FormulaPreservationService now prefers character-level evidence: - When CharacterLevelProtectedText is set with tokens, uses it directly - Falls back to regex-based FormulaProtector when character-level is null - This makes character-level the primary detection, regex the fallback 5. Tests: - FormulaPreservationServiceTests: character-level preference, regex fallback, formula-only detection, analyze/restore pipeline - MathPatternsTests: shared regex constant validates against math fonts, common text fonts, math Unicode, and general spaces https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc

…ruction (Phase 3) Introduces a confidence-based two-tier system for formula protection: **Tier 1 — Hard protection ({vN})**: High-confidence formulas detected by explicit LaTeX delimiters, named commands, math fonts, or math Unicode. LLM must preserve these exactly. **Tier 2 — Soft protection ($...$)**: Low-confidence detections (simple equations like "x = value", sequence tokens like "hidden_state", subscript-by-size-ratio only). Content is reconstructed as LaTeX inline math and the LLM decides whether to preserve or translate. Key changes: 1. FormulaDetector.IsHighConfidence() — classifies token types into high/low confidence tiers based on detection signal strength 2. FormulaLatexReconstructor — reconstructs LaTeX from character-level data: - Detects subscripts/superscripts from font size + baseline position - Reverse-maps Greek Unicode (α→\alpha) and math operators (∈→\in) - CharTextInfo lightweight struct (no PdfPig dependency) 3. FormulaProtector.ProtectTwoTier() — new overload that produces: - {vN} placeholders for high-confidence matches (hard tokens) - $original_text$ inline LaTeX for low-confidence matches (soft) - Existing Protect() kept as backward-compatible wrapper 4. CharacterParagraphBuilder.GetFormulaConfidence() — per-character confidence: High (math font, math Unicode, layout excluded) vs Low (subscript ratio only, U+FFFD only) 5. BuildCharacterLevelProtection() — now uses two-tier output: high-confidence groups → {vN}, low-confidence → $reconstructed_latex$ 6. LLM prompt updated with three variants: - Hard only: "Keep all {vN} placeholders exactly as-is" - Soft only: "$...$ is likely math — keep if math, translate if not" - Both: Combined instructions for {vN} and $...$ 7. Tests: FormulaConfidenceTests (IsHighConfidence, ProtectTwoTier, backward compat), FormulaLatexReconstructorTests (subscript, superscript, Greek mapping, underscore escaping, mixed groups) https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc

…se 4) Addresses broken long-document translation output where tuples like (x1, ..., xn) got eaten, entire sentences disappeared, and literal "sequence1" appeared in the Chinese output. Three root causes fixed: 1. Implicit-subscript tuples were high-confidence {vN} hard placeholders, so the LLM saw opaque markers instead of math-in-context and occasionally rewrote them (e.g. {v1} -> "sequence1"). 2. FormulaPreservationService.Restore hardcoded MissingTokenCount = 0, giving the retry loop no signal to react to dropped placeholders. 3. The retry loop only re-ran on exceptions, never on silent content loss. Fix A — implicit tuples become low-confidence: - New FormulaTokenType.ImplicitTuple; Classify() returns it for (x1, ..., xn) instead of MathSubscript, so IsHighConfidence reports false and the ProtectTwoTier output wraps the span in $...$ for LLM-in-context handling. - Explicit subscripts (h_{t-1}, W_Q) remain MathSubscript/high-confidence. Fix B — quality-feedback retry: - New FormulaRestoreResult record + RestoreWithDiagnostics method report the per-status dropped count and missing indices; the legacy Restore() method becomes a shim. FormulaPreservationService.Restore now populates RestoreOutcome.MissingTokenCount from the real diagnostics. - FormulaProtector.ProtectTwoTier gains an optional demoteLevel parameter; at level 1, MathSubscript/MathSuperscript/Fraction/SquareRoot are demoted to soft $...$ protection. Unambiguous formulas (Greek letters, operators, display math, environments) are never demoted. - BlockContext gains RetryAttempt; when >= 1, FormulaPreservationService bypasses the character-level preemption path and calls ProtectTwoTier with demoteLevel = RetryAttempt to widen soft protection. - DocumentBlockIr carries PreservationContext so the retry loop can rebuild the context for re-protection without re-reading parser signals. - LongDocumentTranslationOptions.EnableQualityFeedbackRetry (default off) gates the new branch. When enabled, TranslateSingleBlockAsync detects PartialRestore / FallbackToOriginal outcomes, re-protects the block with RetryAttempt++ and prepends a reinforcing instruction to the LLM prompt, then continues within the shared MaxRetriesPerBlock budget. Tests: - FormulaDetectorTests: ImplicitTuple classification + confidence. - FormulaConfidenceTests: ProtectTwoTier_ImplicitTuple soft-wraps (x1, ..., xn). - FormulaProtectorTests: demoteLevel 0 vs 1 on MathSubscript; Greek letters stay hard at level 1. - FormulaRestorerTests: RestoreWithDiagnostics reports Full/Partial/Fallback status with correct dropped counts and missing indices. - FormulaPreservationServiceTests: MissingTokenCount wiring for partial/all missing; Protect_RetryAttempt1 demotes subscripts and skips character-level. - LongDocumentTranslationServiceTests: QualityFeedbackRetry re-runs the translator when a placeholder is dropped; the disabled (default) path does not retry. https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc

Eliminate per-span Regex.Replace in StripSyntheticDelimiters by using the existing SoftProtectedSpan.WrappedText for a literal string.Replace. Early-exit ValidateSoftProtectedSpans on empty soft spans and collapse GroupBy + double Count() into a single-pass expected-count dictionary. Extract shared TupleSequenceBody const so the master FormulaRegex and the anchored exact-preservation validators can't drift. Drop the legacy Contains('$') fallback so SoftProtectedSpans is the sole source of truth for soft-math prompting, and trim WHAT-narration comments that duplicated variable names or referenced external pdf2zh line numbers that will rot. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…d TranslationManagerService

…lbackText retry When FormulaAwareTextReconstructor produces text with merged words (e.g., "Mostcompetitiveneural" instead of "Most competitive neural"), the quality gate detects this via 3-layer checks (space density, long-word detection, longest-word ratio) and falls back to PdfPig's original text. Key changes: - Add IsReconstructionQualityAcceptable() with adaptive wordGapScale retry - Add FallbackText field on SourceDocumentBlock/DocumentBlockIr for retry - Add TryPrepareFallbackText() for full re-protection on fallback text - Add PdfExportCheckpointTextResolver for source fallback rendering in PDF - Add word annotation system for difficult words in source-fallback blocks - Add Page2 integration tests using 1706.03762v7.pdf fixture Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ectionTests - Add ScrollHelper with ScrollToPercent and ScrollToFind (percentage-based scanning with incremental search) to replace fragile Mouse.Scroll calls - Add PopButtonSelectionFixture (IClassFixture) so Easydict + Notepad launch once instead of 8 times (once per test method) - Update DarkModeTests, SettingsPageTests, SettingsPageScrollTests to use ScrollHelper instead of private scroll methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…port Integrate FontFitSolver into MuPdfExportService to auto-shrink and truncate translated text blocks when they overflow their bounding boxes. Use per-line render rects derived from source baseline positions so translated text follows the original layout geometry. Key changes: - FontFitRequest: add MaxLineCount and MaxHeight constraints for line-width mode - FontFitSolver: enforce MaxHeight and MaxLineCount in line-width fits - MuPdfExportService: replace monolithic per-block rendering with PrepareBlockForRendering (baseline → line rects) + SolveFontFit + per-line AppendLineTextOperations pipeline - Add erase padding and per-line background erase rects to prevent source text bleed-through - Return structured PageRenderResult / BlockTextRenderResult with shrink/truncate diagnostics and per-page metrics - TranslatedBlockData: add ChunkIndex, PageNumber, SourceBlockId, SourceText, RenderLineRects, BackgroundLineRects fields Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… export When retried or fallback blocks grow beyond their original bounding box, the new page-level layout planner detects overlap and pushes neighboring blocks downward. Also propagates RetryCount through the checkpoint pipeline and extracts shared rect test helpers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…xport

claude added 2 commits April 5, 2026 13:49

xiaocang force-pushed the claude/fix-formula-detection-CF2cm branch from 098913c to 6bf4e16 Compare April 5, 2026 13:50

xiaocang force-pushed the claude/fix-formula-detection-CF2cm branch from 6bf4e16 to 19a1731 Compare April 5, 2026 14:00

claude and others added 8 commits April 5, 2026 15:09

feat: implement cache clearing functionality in TranslationManager an…

12af4f5

…d TranslationManagerService

feat: enhance formula rendering by normalizing LaTeX markup for PDF e…

b62065e

…xport

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor formula preservation into unified service layer#119

Refactor formula preservation into unified service layer#119
xiaocang wants to merge 11 commits intomasterfrom
claude/fix-formula-detection-CF2cm

xiaocang commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xiaocang commented Apr 5, 2026

Summary

Key Changes

Notable Implementation Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants