Refactor formula preservation into unified service layer#119
Open
Refactor formula preservation into unified service layer#119
Conversation
… eliminate full-block fallback Root causes addressed: 1. FormulaRestorer fell back to untranslated originalText when ANY placeholder was missing — now uses graduated fallback (full/partial/original based on the fraction of placeholders present) 2. CharacterParagraphBuilder MathFontRegex lacked word boundaries on BL|RM|EU|LA|RS, causing common fonts like Lato-Regular and TimesNewRoman to false-positive 3. Vertical text matrix (tm.A==0 && tm.D==0) unconditionally classified as formula — now requires math font or math Unicode signal 4. MathUnicodeRegex included U+2000-U+200B general spaces as formula signals — narrowed to U+200B-U+200D (ZWSP/ZWNJ/ZWJ only) 5. Token map from formula protection phase was discarded and re-generated during restoration — now stored in DocumentBlockIr.FormulaTokenMap Architecture: - Extract ContentPreservation abstraction layer (IContentPreservationService) separating evidence (detection) from policy (skip/protect/restore/fallback) - FormulaPreservationService consolidates detection heuristics that were scattered across LongDocumentTranslationService - LongDocumentTranslationService delegates to IContentPreservationService, keeping only orchestration logic - Old internal static methods kept as thin wrappers for test compatibility https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc
…ider (Phase 2)
Phase 2 of the formula detection overhaul:
1. Extract shared MathPatterns constant (MathFontPattern + MathUnicodePattern)
- Single source of truth used by all 3 consumers
- Fixes MathFontRegex bug in WinUI LongDocumentTranslationService (missing
word boundaries on BL|RM|EU|LA|RS — 3rd copy of the regex)
2. Add character-level protection fields to the data pipeline:
- SourceDocumentBlock: CharacterLevelProtectedText, CharacterLevelTokens
- DocumentBlockIr: same fields for flow through IR
- BlockContext: same fields for ContentPreservation service
3. Wire CharacterParagraphBuilder into WinUI block extraction:
- New BuildCharacterLevelProtection() helper converts PdfPig Letters
to CharInfo[], runs CharacterParagraphBuilder.Build(), and extracts
protected text with {v*} placeholders + FormulaToken list
- Called at both block extraction sites (ML-detected and heuristic)
4. FormulaPreservationService now prefers character-level evidence:
- When CharacterLevelProtectedText is set with tokens, uses it directly
- Falls back to regex-based FormulaProtector when character-level is null
- This makes character-level the primary detection, regex the fallback
5. Tests:
- FormulaPreservationServiceTests: character-level preference, regex fallback,
formula-only detection, analyze/restore pipeline
- MathPatternsTests: shared regex constant validates against math fonts,
common text fonts, math Unicode, and general spaces
https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc
098913c to
6bf4e16
Compare
…ruction (Phase 3)
Introduces a confidence-based two-tier system for formula protection:
**Tier 1 — Hard protection ({vN})**:
High-confidence formulas detected by explicit LaTeX delimiters, named commands,
math fonts, or math Unicode. LLM must preserve these exactly.
**Tier 2 — Soft protection ($...$)**:
Low-confidence detections (simple equations like "x = value", sequence tokens
like "hidden_state", subscript-by-size-ratio only). Content is reconstructed
as LaTeX inline math and the LLM decides whether to preserve or translate.
Key changes:
1. FormulaDetector.IsHighConfidence() — classifies token types into high/low
confidence tiers based on detection signal strength
2. FormulaLatexReconstructor — reconstructs LaTeX from character-level data:
- Detects subscripts/superscripts from font size + baseline position
- Reverse-maps Greek Unicode (α→\alpha) and math operators (∈→\in)
- CharTextInfo lightweight struct (no PdfPig dependency)
3. FormulaProtector.ProtectTwoTier() — new overload that produces:
- {vN} placeholders for high-confidence matches (hard tokens)
- $original_text$ inline LaTeX for low-confidence matches (soft)
- Existing Protect() kept as backward-compatible wrapper
4. CharacterParagraphBuilder.GetFormulaConfidence() — per-character
confidence: High (math font, math Unicode, layout excluded) vs
Low (subscript ratio only, U+FFFD only)
5. BuildCharacterLevelProtection() — now uses two-tier output:
high-confidence groups → {vN}, low-confidence → $reconstructed_latex$
6. LLM prompt updated with three variants:
- Hard only: "Keep all {vN} placeholders exactly as-is"
- Soft only: "$...$ is likely math — keep if math, translate if not"
- Both: Combined instructions for {vN} and $...$
7. Tests: FormulaConfidenceTests (IsHighConfidence, ProtectTwoTier,
backward compat), FormulaLatexReconstructorTests (subscript, superscript,
Greek mapping, underscore escaping, mixed groups)
https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc
6bf4e16 to
19a1731
Compare
…se 4)
Addresses broken long-document translation output where tuples like
(x1, ..., xn) got eaten, entire sentences disappeared, and literal
"sequence1" appeared in the Chinese output. Three root causes fixed:
1. Implicit-subscript tuples were high-confidence {vN} hard placeholders,
so the LLM saw opaque markers instead of math-in-context and occasionally
rewrote them (e.g. {v1} -> "sequence1").
2. FormulaPreservationService.Restore hardcoded MissingTokenCount = 0,
giving the retry loop no signal to react to dropped placeholders.
3. The retry loop only re-ran on exceptions, never on silent content loss.
Fix A — implicit tuples become low-confidence:
- New FormulaTokenType.ImplicitTuple; Classify() returns it for (x1, ..., xn)
instead of MathSubscript, so IsHighConfidence reports false and the
ProtectTwoTier output wraps the span in $...$ for LLM-in-context handling.
- Explicit subscripts (h_{t-1}, W_Q) remain MathSubscript/high-confidence.
Fix B — quality-feedback retry:
- New FormulaRestoreResult record + RestoreWithDiagnostics method report
the per-status dropped count and missing indices; the legacy Restore()
method becomes a shim. FormulaPreservationService.Restore now populates
RestoreOutcome.MissingTokenCount from the real diagnostics.
- FormulaProtector.ProtectTwoTier gains an optional demoteLevel parameter;
at level 1, MathSubscript/MathSuperscript/Fraction/SquareRoot are demoted
to soft $...$ protection. Unambiguous formulas (Greek letters, operators,
display math, environments) are never demoted.
- BlockContext gains RetryAttempt; when >= 1, FormulaPreservationService
bypasses the character-level preemption path and calls ProtectTwoTier
with demoteLevel = RetryAttempt to widen soft protection.
- DocumentBlockIr carries PreservationContext so the retry loop can rebuild
the context for re-protection without re-reading parser signals.
- LongDocumentTranslationOptions.EnableQualityFeedbackRetry (default off)
gates the new branch. When enabled, TranslateSingleBlockAsync detects
PartialRestore / FallbackToOriginal outcomes, re-protects the block with
RetryAttempt++ and prepends a reinforcing instruction to the LLM prompt,
then continues within the shared MaxRetriesPerBlock budget.
Tests:
- FormulaDetectorTests: ImplicitTuple classification + confidence.
- FormulaConfidenceTests: ProtectTwoTier_ImplicitTuple soft-wraps (x1, ..., xn).
- FormulaProtectorTests: demoteLevel 0 vs 1 on MathSubscript; Greek letters
stay hard at level 1.
- FormulaRestorerTests: RestoreWithDiagnostics reports Full/Partial/Fallback
status with correct dropped counts and missing indices.
- FormulaPreservationServiceTests: MissingTokenCount wiring for partial/all
missing; Protect_RetryAttempt1 demotes subscripts and skips character-level.
- LongDocumentTranslationServiceTests: QualityFeedbackRetry re-runs the
translator when a placeholder is dropped; the disabled (default) path
does not retry.
https://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc
Eliminate per-span Regex.Replace in StripSyntheticDelimiters by using the
existing SoftProtectedSpan.WrappedText for a literal string.Replace. Early-exit
ValidateSoftProtectedSpans on empty soft spans and collapse GroupBy + double
Count() into a single-pass expected-count dictionary. Extract shared
TupleSequenceBody const so the master FormulaRegex and the anchored
exact-preservation validators can't drift. Drop the legacy Contains('$')
fallback so SoftProtectedSpans is the sole source of truth for soft-math
prompting, and trim WHAT-narration comments that duplicated variable names or
referenced external pdf2zh line numbers that will rot.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d TranslationManagerService
…lbackText retry When FormulaAwareTextReconstructor produces text with merged words (e.g., "Mostcompetitiveneural" instead of "Most competitive neural"), the quality gate detects this via 3-layer checks (space density, long-word detection, longest-word ratio) and falls back to PdfPig's original text. Key changes: - Add IsReconstructionQualityAcceptable() with adaptive wordGapScale retry - Add FallbackText field on SourceDocumentBlock/DocumentBlockIr for retry - Add TryPrepareFallbackText() for full re-protection on fallback text - Add PdfExportCheckpointTextResolver for source fallback rendering in PDF - Add word annotation system for difficult words in source-fallback blocks - Add Page2 integration tests using 1706.03762v7.pdf fixture Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ectionTests - Add ScrollHelper with ScrollToPercent and ScrollToFind (percentage-based scanning with incremental search) to replace fragile Mouse.Scroll calls - Add PopButtonSelectionFixture (IClassFixture) so Easydict + Notepad launch once instead of 8 times (once per test method) - Update DarkModeTests, SettingsPageTests, SettingsPageScrollTests to use ScrollHelper instead of private scroll methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…port Integrate FontFitSolver into MuPdfExportService to auto-shrink and truncate translated text blocks when they overflow their bounding boxes. Use per-line render rects derived from source baseline positions so translated text follows the original layout geometry. Key changes: - FontFitRequest: add MaxLineCount and MaxHeight constraints for line-width mode - FontFitSolver: enforce MaxHeight and MaxLineCount in line-width fits - MuPdfExportService: replace monolithic per-block rendering with PrepareBlockForRendering (baseline → line rects) + SolveFontFit + per-line AppendLineTextOperations pipeline - Add erase padding and per-line background erase rects to prevent source text bleed-through - Return structured PageRenderResult / BlockTextRenderResult with shrink/truncate diagnostics and per-page metrics - TranslatedBlockData: add ChunkIndex, PageNumber, SourceBlockId, SourceText, RenderLineRects, BackgroundLineRects fields Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… export When retried or fallback blocks grow beyond their original bounding box, the new page-level layout planner detects overlap and pushes neighboring blocks downward. Also propagates RetryCount through the checkpoint pipeline and extracts shared rect test helpers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR consolidates formula detection, protection, restoration, and fallback logic into a new
IContentPreservationServiceinterface andFormulaPreservationServiceimplementation. This refactoring extracts scattered heuristics fromLongDocumentTranslationServiceinto a reusable, testable service that separates evidence-based detection from policy decisions.Key Changes
New Content Preservation Service Layer
IContentPreservationServiceinterface with three core operations:Analyze(),Protect(), andRestore()FormulaPreservationServiceconsolidating all formula detection heuristics (block type, font-based, character-based, subscript density)BlockContextandProtectionPlanmodels to decouple evidence from policyUnified Math Pattern Detection
MathPatternsstatic class as single source of truth for math font and Unicode regex patternsCharacterParagraphBuilder,LongDocumentTranslationService, and WinUI extraction to use shared patternsTwo-Tier Formula Protection
ProtectTwoTier()method toFormulaProtectorfor confidence-based output:{vN}hard placeholders$...$inline LaTeX for LLM to decideFormulaDetector.IsHighConfidence()to classify token typesImproved Restoration Fallback
FormulaRestorerwith graduated fallback policy:Character-Level Formula Detection
FormulaLatexReconstructorfor reconstructing LaTeX from character-level PDF dataSourceDocumentBlockand IR buildingRefactored LongDocumentTranslationService
BuildIrAsync()andApplyFormulaProtectionAsync()from static to instance methodsIContentPreservationServicedependency_preservation.Analyze()and_preservation.Protect()callsComprehensive Test Coverage
FormulaPreservationServiceTestscovering character-level preference, fallback behavior, and opaque block detectionFormulaConfidenceTestsfor two-tier protection validationMathPatternsTestsverifying math font/Unicode detection and false-positive preventionFormulaLatexReconstructorTestsfor subscript/superscript and Unicode→LaTeX mappingCharacterParagraphBuilderTestsandFormulaDetectionTestswith false-positive prevention casesNotable Implementation Details
FormulaProtector.Protect()overload maintains existing behavior; newProtectTwoTier()is opt-inMathPatternsto prevent regex drift across multiple fileshttps://claude.ai/code/session_01SogYdoYtZv6gUQbBSQPoxc