Skip to content

ci: Automated document links and anchors validation#8638

Open
yinggeh wants to merge 2 commits intomainfrom
yinggeh/tri-655-validate-links-in-generate_docspy
Open

ci: Automated document links and anchors validation#8638
yinggeh wants to merge 2 commits intomainfrom
yinggeh/tri-655-validate-links-in-generate_docspy

Conversation

@yinggeh
Copy link
Contributor

@yinggeh yinggeh commented Feb 4, 2026

What does the PR do?

Added functions to validate links are valid and headings/anchors are present in the target markdown document. Fail generate-html-documentation job if find any.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • ci

Related PRs:

Where should the reviewer start?

Test plan:

  • CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@yinggeh yinggeh requested review from mc-nv, pskiran1 and whoisj February 4, 2026 23:52
@yinggeh yinggeh self-assigned this Feb 4, 2026
@yinggeh yinggeh added the PR: ci Changes to our CI configuration files and scripts label Feb 4, 2026
Returns:
Raw GitHub URL string, or None if conversion is not applicable
"""
if not "github.com/" in url:

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High documentation

The string
github.com/
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 9 days ago

In general, the fix is to stop treating the URL as an arbitrary string and instead parse it with urllib.parse.urlparse, then inspect the hostname to decide whether it is a GitHub URL. This ensures that only URLs whose actual host is github.com (or an explicitly allowed subdomain, if desired) are treated as GitHub URLs, and prevents bypasses where github.com/ appears in the path or query of some other domain.

For this specific function, the safest behavior that preserves the current intent is: if the URL’s hostname is not github.com, return None immediately. We do not need to allow arbitrary subdomains; the function’s documentation describes conversion of normal GitHub repository URLs, so restricting to github.com is appropriate. Concretely, we should:

  1. Import urllib.parse at the top of the file (alongside the existing urllib imports).

  2. Replace the line if not "github.com/" in url: with parsing logic:

    parsed = urllib.parse.urlparse(url)
    if parsed.hostname != "github.com":
        return None

    This both eliminates the substring check and correctly handles URLs like https://github.com/... regardless of path, while rejecting https://evil.com/github.com/....

No other behavior in _get_github_raw_url needs to change: once we know the host is github.com, the existing .replace("github.com", "raw.githubusercontent.com") and subsequent logic work as before. The only new requirement is the additional urllib.parse import, which is standard-library and does not add external dependencies.

Suggested changeset 1
docs/generate_docs.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/docs/generate_docs.py b/docs/generate_docs.py
--- a/docs/generate_docs.py
+++ b/docs/generate_docs.py
@@ -35,6 +35,7 @@
 import time
 import urllib.error
 import urllib.request
+import urllib.parse
 from enum import Enum
 from logging.handlers import RotatingFileHandler
 from typing import Dict, List, Optional, Tuple, Union
@@ -503,7 +504,8 @@
     Returns:
         Raw GitHub URL string, or None if conversion is not applicable
     """
-    if not "github.com/" in url:
+    parsed = urllib.parse.urlparse(url)
+    if parsed.hostname != "github.com":
         return None
 
     # Case: https://github.com/triton-inference-server/server#triton-inference-server
EOF
@@ -35,6 +35,7 @@
import time
import urllib.error
import urllib.request
import urllib.parse
from enum import Enum
from logging.handlers import RotatingFileHandler
from typing import Dict, List, Optional, Tuple, Union
@@ -503,7 +504,8 @@
Returns:
Raw GitHub URL string, or None if conversion is not applicable
"""
if not "github.com/" in url:
parsed = urllib.parse.urlparse(url)
if parsed.hostname != "github.com":
return None

# Case: https://github.com/triton-inference-server/server#triton-inference-server
Copilot is powered by AI and may make mistakes. Always verify output.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, this should probably be r"https?://(?:www\.)?github\.com"


# Validate anchor if present (for GitHub markdown files where we can reliably validate)
# Other websites use various anchor generation schemes that are hard to predict
if anchor and raw_url_content and "github.com" in base_link:

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High documentation

The string
github.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 10 days ago

In general, to fix incomplete URL substring sanitization, the code should parse the URL (e.g., with urllib.parse.urlparse) and inspect the hostname or netloc fields instead of searching for a host substring in the entire URL string. This avoids matching an allowed hostname that only appears in the path, query, fragment, or inside another hostname.

In this file, the problematic logic is:

if anchor and raw_url_content and "github.com" in base_link:

We should instead parse base_link with urllib.parse.urlparse and check that the hostname is exactly github.com or possibly a subdomain such as raw.githubusercontent.com if that’s desired. To make a minimal, behavior-preserving change, we can restrict ourselves to hostname == "github.com" and retain the rest of the logic unchanged. Concretely:

  1. Add an import for urllib.parse near the existing urllib imports.

  2. Replace the substring check with a host-based check:

    from urllib.parse import urlparse  # at top
    
    ...
    
    if anchor and raw_url_content:
        parsed = urlparse(base_link)
        if parsed.hostname == "github.com":
            # existing GitHub-specific logic

    To avoid changing indentation and control flow more than necessary, we can more simply compute a boolean like is_github_host = parsed.hostname == "github.com" and use it in place of "github.com" in base_link.

No other functions need to be modified, and no new project-level configuration is required.


Suggested changeset 1
docs/generate_docs.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/docs/generate_docs.py b/docs/generate_docs.py
--- a/docs/generate_docs.py
+++ b/docs/generate_docs.py
@@ -35,6 +35,7 @@
 import time
 import urllib.error
 import urllib.request
+from urllib.parse import urlparse
 from enum import Enum
 from logging.handlers import RotatingFileHandler
 from typing import Dict, List, Optional, Tuple, Union
@@ -816,15 +817,17 @@
 
         # Validate anchor if present (for GitHub markdown files where we can reliably validate)
         # Other websites use various anchor generation schemes that are hard to predict
-        if anchor and raw_url_content and "github.com" in base_link:
-            # Check if it's a markdown file or a directory (which would have README.md)
-            is_markdown = base_link.endswith(".md") or (
-                anchor and not _is_file_url(base_link)
-            )
-            if is_markdown:
-                return _validate_anchor_in_content(
-                    anchor, raw_url_content, is_markdown=True, slugger=slugger
+        if anchor and raw_url_content:
+            parsed_link = urlparse(base_link)
+            if parsed_link.hostname == "github.com":
+                # Check if it's a markdown file or a directory (which would have README.md)
+                is_markdown = base_link.endswith(".md") or (
+                    anchor and not _is_file_url(base_link)
                 )
+                if is_markdown:
+                    return _validate_anchor_in_content(
+                        anchor, raw_url_content, is_markdown=True, slugger=slugger
+                    )
 
         # For other URLs with anchors, consider valid if URL exists
         # (we can't reliably validate anchors on non-GitHub sites)
EOF
@@ -35,6 +35,7 @@
import time
import urllib.error
import urllib.request
from urllib.parse import urlparse
from enum import Enum
from logging.handlers import RotatingFileHandler
from typing import Dict, List, Optional, Tuple, Union
@@ -816,15 +817,17 @@

# Validate anchor if present (for GitHub markdown files where we can reliably validate)
# Other websites use various anchor generation schemes that are hard to predict
if anchor and raw_url_content and "github.com" in base_link:
# Check if it's a markdown file or a directory (which would have README.md)
is_markdown = base_link.endswith(".md") or (
anchor and not _is_file_url(base_link)
)
if is_markdown:
return _validate_anchor_in_content(
anchor, raw_url_content, is_markdown=True, slugger=slugger
if anchor and raw_url_content:
parsed_link = urlparse(base_link)
if parsed_link.hostname == "github.com":
# Check if it's a markdown file or a directory (which would have README.md)
is_markdown = base_link.endswith(".md") or (
anchor and not _is_file_url(base_link)
)
if is_markdown:
return _validate_anchor_in_content(
anchor, raw_url_content, is_markdown=True, slugger=slugger
)

# For other URLs with anchors, consider valid if URL exists
# (we can't reliably validate anchors on non-GitHub sites)
Copilot is powered by AI and may make mistakes. Always verify output.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again r"https?://(?:www\.)?github\.com"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change to regex match if this is the intention. But URL may begin with github.com without r"https?".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then r"(?:https?)?://(?:www\.)?github\.com" would work, no?

@yinggeh yinggeh changed the title ci: Automated document link validation ci: Automated document links and anchors validation Feb 4, 2026
README.md
examples/README.md
user_guide/perf_analyzer.md
model_navigator/CHANGELOG.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model navigator is stalled repository we not using it for quiet a while as per my knowledge, is something has changed?
cc: @whoisj

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we should remove it from documentation?

Copy link
Contributor Author

@yinggeh yinggeh Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the repository visibilty to private then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a reasonable approach.

@yinggeh yinggeh force-pushed the yinggeh/tri-655-validate-links-in-generate_docspy branch from fdc2749 to 5ffecd3 Compare February 5, 2026 00:12
Returns:
Raw GitHub URL string, or None if conversion is not applicable
"""
if not "github.com/" in url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, this should probably be r"https?://(?:www\.)?github\.com"


# Validate anchor if present (for GitHub markdown files where we can reliably validate)
# Other websites use various anchor generation schemes that are hard to predict
if anchor and raw_url_content and "github.com" in base_link:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again r"https?://(?:www\.)?github\.com"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR: ci Changes to our CI configuration files and scripts

Development

Successfully merging this pull request may close these issues.

3 participants