Skip to content

Add an option to produce fine-grained HTML renderers based on CommonMark's grammar#635

Open
herley-shaori wants to merge 29 commits intoWitiko:mainfrom
herley-shaori:feature/issue-606-parse-html-block-types
Open

Add an option to produce fine-grained HTML renderers based on CommonMark's grammar#635
herley-shaori wants to merge 29 commits intoWitiko:mainfrom
herley-shaori:feature/issue-606-parse-html-block-types

Conversation

@herley-shaori
Copy link
Copy Markdown

@herley-shaori herley-shaori commented Mar 11, 2026

Summary

Implements the first task of #606 by exposing CommonMark's HTML block type differentiation through individual renderers.

A new boolean option parseHtmlBlocks (default: false) is added. When enabled alongside the html option, the parser produces type-specific renderers instead of the generic inputBlockHtmlElement and inlineHtmlTag renderers:

Block HTML renderers (each receives a filename of a file containing the HTML block contents):

Renderer CommonMark type Matches
inputBlockHtmlCommentElement Type 2 <!-- ... -->
inputBlockHtmlInstructionElement Type 3 <? ... ?>
inputBlockHtmlDeclarationElement Type 4 <! ... >
inputBlockHtmlCdataElement Type 5 <![CDATA[ ... ]]>
inputBlockHtmlSpecialElement Type 1 <script>, <pre>, <style>, <textarea>
inputBlockHtmlRegularElement Type 6 <div>, <table>, <form>, etc.
inputBlockHtmlAnyElement Type 7 Any other complete tag on its own line

Inline HTML renderers (each receives the tag contents as a string):

Renderer Matches
inlineHtmlInstruction <? ... ?>
inlineHtmlDeclaration <! ... >
inlineHtmlCdataSection <![CDATA[ ... ]]>
inlineHtmlOpenTag <tag>
inlineHtmlCloseTag </tag>
inlineHtmlEmptyTag <tag/>

The existing inlineHtmlComment renderer remains unchanged. When parseHtmlBlocks is false (default), behavior is fully backward compatible.

Changes

  • markdown.dtx: Added parseHtmlBlocks option definition, 13 new renderer registrations, 13 new Lua writer functions, conditional parser routing in DisplayHtml and InlineHtml, and documentation for all new renderers.
  • tests/support/keyval-setup.tex: Added test renderer prototypes for all 13 new renderers.
  • tests/testfiles/regression/github/issue-606-block-html-types.test: New regression test verifying type-specific renderers are produced when parseHtmlBlocks is enabled.

Test plan

  • New regression test passes across all non-ConTeXt TeX formats (luatex, lualatex, pdflatex, pdftex)
  • All 44 existing CommonMark_0.30/html_blocks tests pass (backward compatibility)
  • Both existing CommonMark_0.31.2/raw_html tests pass (backward compatibility)
  • ConTeXt tests (skipped locally due to missing rename utility — CI should cover this)

Note

This PR addresses Task 1 of #606. Task 2 (renderers corresponding to HTML nodes) would require a more substantial HTML parser and is left for a follow-up.

Continues #606.

herley and others added 2 commits March 11, 2026 18:15
…TML types

Implement the first task of issue Witiko#606: expose CommonMark's HTML block
type differentiation through individual renderers.

When the new `parseHtmlBlocks` option is enabled (default: false), the
parser produces type-specific renderers instead of the generic
`inputBlockHtmlElement` and `inlineHtmlTag` renderers:

Block HTML renderers (by CommonMark type):
- inputBlockHtmlCommentElement (type 2: HTML comments)
- inputBlockHtmlInstructionElement (type 3: processing instructions)
- inputBlockHtmlDeclarationElement (type 4: declarations)
- inputBlockHtmlCdataElement (type 5: CDATA sections)
- inputBlockHtmlSpecialElement (type 1: script/pre/style/textarea)
- inputBlockHtmlRegularElement (type 6: div/table/form etc.)
- inputBlockHtmlAnyElement (type 7: any other complete tag)

Inline HTML renderers:
- inlineHtmlInstruction (processing instructions)
- inlineHtmlDeclaration (declarations)
- inlineHtmlCdataSection (CDATA sections)
- inlineHtmlOpenTag (opening tags)
- inlineHtmlCloseTag (closing tags)
- inlineHtmlEmptyTag (self-closing tags)

The existing inlineHtmlComment renderer remains unchanged.
When parseHtmlBlocks is false (default), behavior is fully backward
compatible.

Closes Witiko#606 (task 1)
@Witiko
Copy link
Copy Markdown
Owner

Witiko commented Mar 11, 2026

Hi @herley-shaori, this looks great, especially for a first-time contribution. Thanks for putting in the effort!

Below, I reviewed the code. If you'd like to make the necessary changes yourself, that would be great; otherwise, I'm happy to take over the PR from here.

@Witiko Witiko added lua Related to the Lua interface and implementation conversion output Related to the output format of the Markdown-to-TeX conversion labels Mar 11, 2026
…option

Apply all changes requested in the code review:

- Rename `parseHtmlBlocks` (boolean) option to `htmlOutput` (string) with
  values `basic` (default) and `commonmark`, making the design more
  future-proof for potential additional values like `nodes`.
- Simplify option documentation to not list individual renderers (consistent
  with other option descriptions in the codebase).
- Fix documentation markup: replace `\Mdef` with `\mref` for cross-references
  to renderers defined elsewhere.
- Add `[raw-html]` link reference for inline HTML construct types.
- Fix Lua code indentation in InlineHtml and DisplayHtml parser sections.
- Remove unnecessary `if: format != 'context'` condition from test file.
- Update .gitignore entry from `venv/` to `tests/test-virtualenv`.
@Witiko
Copy link
Copy Markdown
Owner

Witiko commented Mar 11, 2026

Tasks for myself in addition to the comments from the code review:

  • Set htmlOutput = "commonmark" when the experimental option has been enabled.
  • Improve the naming of writers, i.e. why block_html_comment_element when it's an HTML node (not element).
    • Same for renderers and renderer prototypes, why \markdownRendererInputBlockHtmlCommentElement rather than just \markdownRendererInputBlockHtmlComment?
  • Add a code example redefining \markdownRendererInputBlockHtmlComment to the user manual, similar to the existing code example for \markdownRendererInlineComment.
  • Update CHANGES.md.

@herley-shaori
Copy link
Copy Markdown
Author

Hi @Witiko, thanks so much for taking the time to review this! Your feedback is really valuable. I realize I still have a lot to learn — I'll step back and leave this PR to you. If there's anything I can do to help in the future, happy to contribute! 😊

herley and others added 3 commits March 12, 2026 07:07
- Add \mref{markdownRendererInlineHtmlComment} to the `basic` option
  description for completeness.
- Add [html-blocks] and [raw-html] link references to the option
  documentation fragment (each \begin{markdown} fragment needs its own
  link references).
- Split renderer documentation into separate sections:
  - Rename "HTML Tag and Element Renderers" to "Basic HTML Tag and
    Element Renderers" for the generic renderers.
  - Add "CommonMark Block HTML Element Renderers" section for
    type-specific block renderers.
  - Add "CommonMark Inline HTML Renderers" section for type-specific
    inline renderers.
- Fix Lua code style: add space after Cs( in else branches of
  InlineHtml and DisplayHtml parsers for consistency.
Witiko
Witiko previously approved these changes Mar 17, 2026
@Witiko Witiko requested a review from lostenderman March 17, 2026 13:39
Move issue-606-block-html-types.test from regression/github/ to
unit/lunamark-markdown/html-output-commonmark.test as requested
in review feedback.
@Witiko Witiko changed the title Add parseHtmlBlocks option for type-specific HTML renderers Add htmlOutput option to produce fine-grained HTML renderers based on CommonMark's grammar Mar 18, 2026
@Witiko Witiko changed the title Add htmlOutput option to produce fine-grained HTML renderers based on CommonMark's grammar Add an option to produce fine-grained HTML renderers based on CommonMark's grammar Mar 18, 2026
@Witiko Witiko added this to the 3.15.0 milestone Mar 18, 2026
@Witiko
Copy link
Copy Markdown
Owner

Witiko commented Mar 23, 2026

Everything seems OK. I am waiting with the merge until after the v3.14.1 bugfix release, likely at the end of this week, so that we don't ship new features with it.

@lostenderman lostenderman self-requested a review March 29, 2026 17:01
Copy link
Copy Markdown
Owner

@Witiko Witiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@herley-shaori: I found a few other issues that still need to be addressed. If you have the time and would like to, please feel free to take care of them; otherwise, I can finish them myself.

Comment on lines +16184 to +16185
containing the corresponding HTML text. Their prototypes fall back on
\mref{markdownRendererInputBlockHtmlElementPrototype}.
Copy link
Copy Markdown
Owner

@Witiko Witiko Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is currently implemented, i.e. the prototypes produce an empty expansion at the moment.

Default definitions that fall back onto \markdownRendererInputBlockHtmlElementPrototype should be added to the section ### Token Renderer Prototypes {#tex-token-renderer-prototypes}.

Here is an easy way to verify that this was implemented correctly: after this change, the test file no-html-output.test should pass even when htmlOutput = commonmark is added to the \markdownSetup command at the top of the file and all newly added definitions are removed from keyval-setup.tex.

Copy link
Copy Markdown
Owner

@Witiko Witiko Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, blockHtmlStandaloneTag can't really fall back onto the inputBlockHtmlElement due to the argument mismatch. Instead, it may need to fall back onto e.g. the inlineHtmlTag prototype.

Comment on lines +16347 to +16348
Each of these macros receives a single argument with the HTML text. Their
prototypes fall back on \mref{markdownRendererInlineHtmlTagPrototype}.
Copy link
Copy Markdown
Owner

@Witiko Witiko Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is currently implemented, i.e. the prototypes produce an empty expansion at the moment.

Default definitions that fall back onto \markdownRendererInlineHtmlTagPrototype should be added to the section ### Token Renderer Prototypes {#tex-token-renderer-prototypes}.

Here is an easy way to verify that this was implemented correctly: after this change, the test file no-html-output.test should pass even when htmlOutput = commonmark is added to the \markdownSetup command at the top of the file and all newly added definitions are removed from keyval-setup.tex.

\markdownSetup{
htmlOutput = commonmark,
renderers = {
inputBlockHtmlComment = {\marginnote{\input{#1}}},
Copy link
Copy Markdown
Owner

@Witiko Witiko Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still won't work, since we don't strip the leading <!-- and the trailing -->. For consistency with the inlineHtmlComment renderer, we should likely strip these, parse the content and provide it directly, not through a separate file, falling back on the inlineHtmlComment renderer prototype instead of inputBlockHtmlElement.

Then, this example should be updated as follows:

Suggested change
inputBlockHtmlComment = {\marginnote{\input{#1}}},
blockHtmlComment = \marginnote{#1},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conversion output Related to the output format of the Markdown-to-TeX conversion lua Related to the Lua interface and implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants