ci: Automated document links and anchors validation by yinggeh · Pull Request #8638 · triton-inference-server/server

yinggeh · 2026-02-04T23:52:47Z

What does the PR do?

Added functions to validate links are valid and headings/anchors are present in the target markdown document. Fail generate-html-documentation job if find any.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

ci

Related PRs:

Where should the reviewer start?

Test plan:

CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

whoisj · 2026-02-05T00:27:16Z

docs/generate_docs.py

+    Returns:
+        Raw GitHub URL string, or None if conversion is not applicable
+    """
+    if not "github.com/" in url:


In general, the fix is to stop treating the URL as an arbitrary string and instead parse it with urllib.parse.urlparse, then inspect the hostname to decide whether it is a GitHub URL. This ensures that only URLs whose actual host is github.com (or an explicitly allowed subdomain, if desired) are treated as GitHub URLs, and prevents bypasses where github.com/ appears in the path or query of some other domain.

For this specific function, the safest behavior that preserves the current intent is: if the URL’s hostname is not github.com, return None immediately. We do not need to allow arbitrary subdomains; the function’s documentation describes conversion of normal GitHub repository URLs, so restricting to github.com is appropriate. Concretely, we should:

Import urllib.parse at the top of the file (alongside the existing urllib imports).

Replace the line if not "github.com/" in url: with parsing logic:

parsed = urllib.parse.urlparse(url) if parsed.hostname != "github.com": return None

This both eliminates the substring check and correctly handles URLs like https://github.com/... regardless of path, while rejecting https://evil.com/github.com/....

No other behavior in _get_github_raw_url needs to change: once we know the host is github.com, the existing .replace("github.com", "raw.githubusercontent.com") and subsequent logic work as before. The only new requirement is the additional urllib.parse import, which is standard-library and does not add external dependencies.

agreed, this should probably be r"https?://(?:www\.)?github\.com"

whoisj · 2026-02-05T00:28:08Z

docs/generate_docs.py

+
+        # Validate anchor if present (for GitHub markdown files where we can reliably validate)
+        # Other websites use various anchor generation schemes that are hard to predict
+        if anchor and raw_url_content and "github.com" in base_link:


In general, to fix incomplete URL substring sanitization, the code should parse the URL (e.g., with urllib.parse.urlparse) and inspect the hostname or netloc fields instead of searching for a host substring in the entire URL string. This avoids matching an allowed hostname that only appears in the path, query, fragment, or inside another hostname.

In this file, the problematic logic is:

if anchor and raw_url_content and "github.com" in base_link:

We should instead parse base_link with urllib.parse.urlparse and check that the hostname is exactly github.com or possibly a subdomain such as raw.githubusercontent.com if that’s desired. To make a minimal, behavior-preserving change, we can restrict ourselves to hostname == "github.com" and retain the rest of the logic unchanged. Concretely:

Add an import for urllib.parse near the existing urllib imports.

Replace the substring check with a host-based check:

from urllib.parse import urlparse # at top ... if anchor and raw_url_content: parsed = urlparse(base_link) if parsed.hostname == "github.com": # existing GitHub-specific logic

To avoid changing indentation and control flow more than necessary, we can more simply compute a boolean like is_github_host = parsed.hostname == "github.com" and use it in place of "github.com" in base_link.

No other functions need to be modified, and no new project-level configuration is required.

again r"https?://(?:www\.)?github\.com"

I can change to regex match if this is the intention. But URL may begin with github.com without r"https?".

then r"(?:https?)?://(?:www\.)?github\.com" would work, no?

mc-nv · 2026-02-05T00:03:14Z

docs/exclusions.txt

 README.md
 examples/README.md
 user_guide/perf_analyzer.md
+model_navigator/CHANGELOG.md


Model navigator is stalled repository we not using it for quiet a while as per my knowledge, is something has changed?
cc: @whoisj

We still have it here https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_navigator/README.html

May be we should remove it from documentation?

Let's change the repository visibilty to private then?

seems like a reasonable approach.

whoisj · 2026-02-05T00:27:16Z

docs/generate_docs.py

+    Returns:
+        Raw GitHub URL string, or None if conversion is not applicable
+    """
+    if not "github.com/" in url:


agreed, this should probably be r"https?://(?:www\.)?github\.com"

whoisj · 2026-02-05T00:28:08Z

docs/generate_docs.py

+
+        # Validate anchor if present (for GitHub markdown files where we can reliably validate)
+        # Other websites use various anchor generation schemes that are hard to predict
+        if anchor and raw_url_content and "github.com" in base_link:


again r"https?://(?:www\.)?github\.com"

docs/Dockerfile.docs

yinggeh requested review from mc-nv, pskiran1 and whoisj February 4, 2026 23:52

yinggeh self-assigned this Feb 4, 2026

yinggeh added the PR: ci Changes to our CI configuration files and scripts label Feb 4, 2026

github-advanced-security bot found potential problems Feb 4, 2026

View reviewed changes

yinggeh changed the title ~~ci: Automated document link validation~~ ci: Automated document links and anchors validation Feb 4, 2026

mc-nv reviewed Feb 5, 2026

View reviewed changes

Validate links

5ffecd3

yinggeh force-pushed the yinggeh/tri-655-validate-links-in-generate_docspy branch from fdc2749 to 5ffecd3 Compare February 5, 2026 00:12

whoisj requested changes Feb 5, 2026

View reviewed changes

Install github-slugger

c1087d1

whoisj reviewed Feb 5, 2026

View reviewed changes

docs/Dockerfile.docs Show resolved Hide resolved

@@ -35,6 +35,7 @@
             import time
             import urllib.error
             import urllib.request
+            import urllib.parse
             from enum import Enum
             from logging.handlers import RotatingFileHandler
             from typing import Dict, List, Optional, Tuple, Union
@@ -503,7 +504,8 @@
                 Returns:
                     Raw GitHub URL string, or None if conversion is not applicable
                 """
-                if not "github.com/" in url:
+                parsed = urllib.parse.urlparse(url)
+                if parsed.hostname != "github.com":
                     return None
                 # Case: https://github.com/triton-inference-server/server#triton-inference-server

@@ -35,6 +35,7 @@
             import time
             import urllib.error
             import urllib.request
+            from urllib.parse import urlparse
             from enum import Enum
             from logging.handlers import RotatingFileHandler
             from typing import Dict, List, Optional, Tuple, Union
@@ -816,15 +817,17 @@
                     # Validate anchor if present (for GitHub markdown files where we can reliably validate)
                     # Other websites use various anchor generation schemes that are hard to predict
-                    if anchor and raw_url_content and "github.com" in base_link:
-                        # Check if it's a markdown file or a directory (which would have README.md)
-                        is_markdown = base_link.endswith(".md") or (
-                            anchor and not _is_file_url(base_link)
-                        )
-                        if is_markdown:
-                            return _validate_anchor_in_content(
-                                anchor, raw_url_content, is_markdown=True, slugger=slugger
+                    if anchor and raw_url_content:
+                        parsed_link = urlparse(base_link)
+                        if parsed_link.hostname == "github.com":
+                            # Check if it's a markdown file or a directory (which would have README.md)
+                            is_markdown = base_link.endswith(".md") or (
+                                anchor and not _is_file_url(base_link)
                             )
+                            if is_markdown:
+                                return _validate_anchor_in_content(
+                                    anchor, raw_url_content, is_markdown=True, slugger=slugger
+                                )
                     # For other URLs with anchors, consider valid if URL exists
                     # (we can't reliably validate anchors on non-GitHub sites)

Conversation

yinggeh commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Uh oh!

Check failure

Uh oh!

Copilot Autofix

Choose a reason for hiding this comment

Uh oh!

Check failure

Uh oh!

Copilot Autofix

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yinggeh Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

yinggeh commented Feb 4, 2026 •

edited

Loading

yinggeh Feb 5, 2026 •

edited

Loading