Skip to content

Refactor formula preservation into unified service layer#119

Open
xiaocang wants to merge 11 commits intomasterfrom
claude/fix-formula-detection-CF2cm
Open

Refactor formula preservation into unified service layer#119
xiaocang wants to merge 11 commits intomasterfrom
claude/fix-formula-detection-CF2cm

Conversation

@xiaocang
Copy link
Copy Markdown
Owner

@xiaocang xiaocang commented Apr 5, 2026

Summary

This PR consolidates formula detection, protection, restoration, and fallback logic into a new IContentPreservationService interface and FormulaPreservationService implementation. This refactoring extracts scattered heuristics from LongDocumentTranslationService into a reusable, testable service that separates evidence-based detection from policy decisions.

Key Changes

  • New Content Preservation Service Layer

    • Added IContentPreservationService interface with three core operations: Analyze(), Protect(), and Restore()
    • Implemented FormulaPreservationService consolidating all formula detection heuristics (block type, font-based, character-based, subscript density)
    • Created BlockContext and ProtectionPlan models to decouple evidence from policy
  • Unified Math Pattern Detection

    • Extracted MathPatterns static class as single source of truth for math font and Unicode regex patterns
    • Fixed word-boundary anchoring on short abbreviations (BL, RM, EU, LA, RS) to prevent false positives in common text fonts (e.g., "Lato-Regular", "TimesNewRoman")
    • Updated CharacterParagraphBuilder, LongDocumentTranslationService, and WinUI extraction to use shared patterns
  • Two-Tier Formula Protection

    • Added ProtectTwoTier() method to FormulaProtector for confidence-based output:
      • High-confidence matches → {vN} hard placeholders
      • Low-confidence matches → $...$ inline LaTeX for LLM to decide
    • Implemented FormulaDetector.IsHighConfidence() to classify token types
  • Improved Restoration Fallback

    • Enhanced FormulaRestorer with graduated fallback policy:
      • All placeholders present → full restore with validation
      • ≥50% present → partial restore (replace available, omit missing)
      • <50% present → fall back to original text
    • Prevents silent data loss when LLM drops formula placeholders
  • Character-Level Formula Detection

    • Added FormulaLatexReconstructor for reconstructing LaTeX from character-level PDF data
    • Detects subscripts/superscripts from font size and baseline position
    • Reverse-maps Greek/operator Unicode to LaTeX commands
    • Integrated character-level protection into SourceDocumentBlock and IR building
  • Refactored LongDocumentTranslationService

    • Changed BuildIrAsync() and ApplyFormulaProtectionAsync() from static to instance methods
    • Injected IContentPreservationService dependency
    • Replaced inline detection logic with _preservation.Analyze() and _preservation.Protect() calls
    • Preserved character-level tokens through IR pipeline
  • Comprehensive Test Coverage

    • Added FormulaPreservationServiceTests covering character-level preference, fallback behavior, and opaque block detection
    • Added FormulaConfidenceTests for two-tier protection validation
    • Added MathPatternsTests verifying math font/Unicode detection and false-positive prevention
    • Added FormulaLatexReconstructorTests for subscript/superscript and Unicode→LaTeX mapping
    • Updated CharacterParagraphBuilderTests and FormulaDetectionTests with false-positive prevention cases

Notable Implementation Details

  • Preference Hierarchy: Character-level evidence (from PDF character analysis) is preferred over regex-based detection when available
  • Backward Compatibility: FormulaProtector.Protect() overload maintains existing behavior; new ProtectTwoTier() is opt-in
  • Single Source of Truth: Math patterns centralized in MathPatterns to prevent regex drift across multiple files
  • Graduated Fallback: Restoration no longer requires 100% placeholder presence; partial restoration at ≥50% threshold reduces data loss risk

https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc

claude added 2 commits April 5, 2026 13:49
… eliminate full-block fallback

Root causes addressed:
1. FormulaRestorer fell back to untranslated originalText when ANY placeholder
   was missing — now uses graduated fallback (full/partial/original based on
   the fraction of placeholders present)
2. CharacterParagraphBuilder MathFontRegex lacked word boundaries on BL|RM|EU|LA|RS,
   causing common fonts like Lato-Regular and TimesNewRoman to false-positive
3. Vertical text matrix (tm.A==0 && tm.D==0) unconditionally classified as formula —
   now requires math font or math Unicode signal
4. MathUnicodeRegex included U+2000-U+200B general spaces as formula signals —
   narrowed to U+200B-U+200D (ZWSP/ZWNJ/ZWJ only)
5. Token map from formula protection phase was discarded and re-generated
   during restoration — now stored in DocumentBlockIr.FormulaTokenMap

Architecture:
- Extract ContentPreservation abstraction layer (IContentPreservationService)
  separating evidence (detection) from policy (skip/protect/restore/fallback)
- FormulaPreservationService consolidates detection heuristics that were
  scattered across LongDocumentTranslationService
- LongDocumentTranslationService delegates to IContentPreservationService,
  keeping only orchestration logic
- Old internal static methods kept as thin wrappers for test compatibility

https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc
…ider (Phase 2)

Phase 2 of the formula detection overhaul:

1. Extract shared MathPatterns constant (MathFontPattern + MathUnicodePattern)
   - Single source of truth used by all 3 consumers
   - Fixes MathFontRegex bug in WinUI LongDocumentTranslationService (missing
     word boundaries on BL|RM|EU|LA|RS — 3rd copy of the regex)

2. Add character-level protection fields to the data pipeline:
   - SourceDocumentBlock: CharacterLevelProtectedText, CharacterLevelTokens
   - DocumentBlockIr: same fields for flow through IR
   - BlockContext: same fields for ContentPreservation service

3. Wire CharacterParagraphBuilder into WinUI block extraction:
   - New BuildCharacterLevelProtection() helper converts PdfPig Letters
     to CharInfo[], runs CharacterParagraphBuilder.Build(), and extracts
     protected text with {v*} placeholders + FormulaToken list
   - Called at both block extraction sites (ML-detected and heuristic)

4. FormulaPreservationService now prefers character-level evidence:
   - When CharacterLevelProtectedText is set with tokens, uses it directly
   - Falls back to regex-based FormulaProtector when character-level is null
   - This makes character-level the primary detection, regex the fallback

5. Tests:
   - FormulaPreservationServiceTests: character-level preference, regex fallback,
     formula-only detection, analyze/restore pipeline
   - MathPatternsTests: shared regex constant validates against math fonts,
     common text fonts, math Unicode, and general spaces

https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc
@xiaocang xiaocang force-pushed the claude/fix-formula-detection-CF2cm branch from 098913c to 6bf4e16 Compare April 5, 2026 13:50
…ruction (Phase 3)

Introduces a confidence-based two-tier system for formula protection:

**Tier 1 — Hard protection ({vN})**:
High-confidence formulas detected by explicit LaTeX delimiters, named commands,
math fonts, or math Unicode. LLM must preserve these exactly.

**Tier 2 — Soft protection ($...$)**:
Low-confidence detections (simple equations like "x = value", sequence tokens
like "hidden_state", subscript-by-size-ratio only). Content is reconstructed
as LaTeX inline math and the LLM decides whether to preserve or translate.

Key changes:

1. FormulaDetector.IsHighConfidence() — classifies token types into high/low
   confidence tiers based on detection signal strength

2. FormulaLatexReconstructor — reconstructs LaTeX from character-level data:
   - Detects subscripts/superscripts from font size + baseline position
   - Reverse-maps Greek Unicode (α→\alpha) and math operators (∈→\in)
   - CharTextInfo lightweight struct (no PdfPig dependency)

3. FormulaProtector.ProtectTwoTier() — new overload that produces:
   - {vN} placeholders for high-confidence matches (hard tokens)
   - $original_text$ inline LaTeX for low-confidence matches (soft)
   - Existing Protect() kept as backward-compatible wrapper

4. CharacterParagraphBuilder.GetFormulaConfidence() — per-character
   confidence: High (math font, math Unicode, layout excluded) vs
   Low (subscript ratio only, U+FFFD only)

5. BuildCharacterLevelProtection() — now uses two-tier output:
   high-confidence groups → {vN}, low-confidence → $reconstructed_latex$

6. LLM prompt updated with three variants:
   - Hard only: "Keep all {vN} placeholders exactly as-is"
   - Soft only: "$...$ is likely math — keep if math, translate if not"
   - Both: Combined instructions for {vN} and $...$

7. Tests: FormulaConfidenceTests (IsHighConfidence, ProtectTwoTier,
   backward compat), FormulaLatexReconstructorTests (subscript, superscript,
   Greek mapping, underscore escaping, mixed groups)

https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc
@xiaocang xiaocang force-pushed the claude/fix-formula-detection-CF2cm branch from 6bf4e16 to 19a1731 Compare April 5, 2026 14:00
claude and others added 8 commits April 5, 2026 15:09
…se 4)

Addresses broken long-document translation output where tuples like
(x1, ..., xn) got eaten, entire sentences disappeared, and literal
"sequence1" appeared in the Chinese output. Three root causes fixed:

1. Implicit-subscript tuples were high-confidence {vN} hard placeholders,
   so the LLM saw opaque markers instead of math-in-context and occasionally
   rewrote them (e.g. {v1} -> "sequence1").
2. FormulaPreservationService.Restore hardcoded MissingTokenCount = 0,
   giving the retry loop no signal to react to dropped placeholders.
3. The retry loop only re-ran on exceptions, never on silent content loss.

Fix A — implicit tuples become low-confidence:
- New FormulaTokenType.ImplicitTuple; Classify() returns it for (x1, ..., xn)
  instead of MathSubscript, so IsHighConfidence reports false and the
  ProtectTwoTier output wraps the span in $...$ for LLM-in-context handling.
- Explicit subscripts (h_{t-1}, W_Q) remain MathSubscript/high-confidence.

Fix B — quality-feedback retry:
- New FormulaRestoreResult record + RestoreWithDiagnostics method report
  the per-status dropped count and missing indices; the legacy Restore()
  method becomes a shim. FormulaPreservationService.Restore now populates
  RestoreOutcome.MissingTokenCount from the real diagnostics.
- FormulaProtector.ProtectTwoTier gains an optional demoteLevel parameter;
  at level 1, MathSubscript/MathSuperscript/Fraction/SquareRoot are demoted
  to soft $...$ protection. Unambiguous formulas (Greek letters, operators,
  display math, environments) are never demoted.
- BlockContext gains RetryAttempt; when >= 1, FormulaPreservationService
  bypasses the character-level preemption path and calls ProtectTwoTier
  with demoteLevel = RetryAttempt to widen soft protection.
- DocumentBlockIr carries PreservationContext so the retry loop can rebuild
  the context for re-protection without re-reading parser signals.
- LongDocumentTranslationOptions.EnableQualityFeedbackRetry (default off)
  gates the new branch. When enabled, TranslateSingleBlockAsync detects
  PartialRestore / FallbackToOriginal outcomes, re-protects the block with
  RetryAttempt++ and prepends a reinforcing instruction to the LLM prompt,
  then continues within the shared MaxRetriesPerBlock budget.

Tests:
- FormulaDetectorTests: ImplicitTuple classification + confidence.
- FormulaConfidenceTests: ProtectTwoTier_ImplicitTuple soft-wraps (x1, ..., xn).
- FormulaProtectorTests: demoteLevel 0 vs 1 on MathSubscript; Greek letters
  stay hard at level 1.
- FormulaRestorerTests: RestoreWithDiagnostics reports Full/Partial/Fallback
  status with correct dropped counts and missing indices.
- FormulaPreservationServiceTests: MissingTokenCount wiring for partial/all
  missing; Protect_RetryAttempt1 demotes subscripts and skips character-level.
- LongDocumentTranslationServiceTests: QualityFeedbackRetry re-runs the
  translator when a placeholder is dropped; the disabled (default) path
  does not retry.

https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc
Eliminate per-span Regex.Replace in StripSyntheticDelimiters by using the
existing SoftProtectedSpan.WrappedText for a literal string.Replace. Early-exit
ValidateSoftProtectedSpans on empty soft spans and collapse GroupBy + double
Count() into a single-pass expected-count dictionary. Extract shared
TupleSequenceBody const so the master FormulaRegex and the anchored
exact-preservation validators can't drift. Drop the legacy Contains('$')
fallback so SoftProtectedSpans is the sole source of truth for soft-math
prompting, and trim WHAT-narration comments that duplicated variable names or
referenced external pdf2zh line numbers that will rot.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lbackText retry

When FormulaAwareTextReconstructor produces text with merged words (e.g.,
"Mostcompetitiveneural" instead of "Most competitive neural"), the quality
gate detects this via 3-layer checks (space density, long-word detection,
longest-word ratio) and falls back to PdfPig's original text.

Key changes:
- Add IsReconstructionQualityAcceptable() with adaptive wordGapScale retry
- Add FallbackText field on SourceDocumentBlock/DocumentBlockIr for retry
- Add TryPrepareFallbackText() for full re-protection on fallback text
- Add PdfExportCheckpointTextResolver for source fallback rendering in PDF
- Add word annotation system for difficult words in source-fallback blocks
- Add Page2 integration tests using 1706.03762v7.pdf fixture

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ectionTests

- Add ScrollHelper with ScrollToPercent and ScrollToFind (percentage-based
  scanning with incremental search) to replace fragile Mouse.Scroll calls
- Add PopButtonSelectionFixture (IClassFixture) so Easydict + Notepad launch
  once instead of 8 times (once per test method)
- Update DarkModeTests, SettingsPageTests, SettingsPageScrollTests to use
  ScrollHelper instead of private scroll methods

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…port

Integrate FontFitSolver into MuPdfExportService to auto-shrink and
truncate translated text blocks when they overflow their bounding boxes.
Use per-line render rects derived from source baseline positions so
translated text follows the original layout geometry.

Key changes:
- FontFitRequest: add MaxLineCount and MaxHeight constraints for
  line-width mode
- FontFitSolver: enforce MaxHeight and MaxLineCount in line-width fits
- MuPdfExportService: replace monolithic per-block rendering with
  PrepareBlockForRendering (baseline → line rects) + SolveFontFit +
  per-line AppendLineTextOperations pipeline
- Add erase padding and per-line background erase rects to prevent
  source text bleed-through
- Return structured PageRenderResult / BlockTextRenderResult with
  shrink/truncate diagnostics and per-page metrics
- TranslatedBlockData: add ChunkIndex, PageNumber, SourceBlockId,
  SourceText, RenderLineRects, BackgroundLineRects fields

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… export

When retried or fallback blocks grow beyond their original bounding box,
the new page-level layout planner detects overlap and pushes neighboring
blocks downward. Also propagates RetryCount through the checkpoint
pipeline and extracts shared rect test helpers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants