ci: Automated document links and anchors validation#8638
ci: Automated document links and anchors validation#8638
Conversation
| Returns: | ||
| Raw GitHub URL string, or None if conversion is not applicable | ||
| """ | ||
| if not "github.com/" in url: |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High documentation
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 9 days ago
In general, the fix is to stop treating the URL as an arbitrary string and instead parse it with urllib.parse.urlparse, then inspect the hostname to decide whether it is a GitHub URL. This ensures that only URLs whose actual host is github.com (or an explicitly allowed subdomain, if desired) are treated as GitHub URLs, and prevents bypasses where github.com/ appears in the path or query of some other domain.
For this specific function, the safest behavior that preserves the current intent is: if the URL’s hostname is not github.com, return None immediately. We do not need to allow arbitrary subdomains; the function’s documentation describes conversion of normal GitHub repository URLs, so restricting to github.com is appropriate. Concretely, we should:
-
Import
urllib.parseat the top of the file (alongside the existingurllibimports). -
Replace the line
if not "github.com/" in url:with parsing logic:parsed = urllib.parse.urlparse(url) if parsed.hostname != "github.com": return None
This both eliminates the substring check and correctly handles URLs like
https://github.com/...regardless of path, while rejectinghttps://evil.com/github.com/....
No other behavior in _get_github_raw_url needs to change: once we know the host is github.com, the existing .replace("github.com", "raw.githubusercontent.com") and subsequent logic work as before. The only new requirement is the additional urllib.parse import, which is standard-library and does not add external dependencies.
| @@ -35,6 +35,7 @@ | ||
| import time | ||
| import urllib.error | ||
| import urllib.request | ||
| import urllib.parse | ||
| from enum import Enum | ||
| from logging.handlers import RotatingFileHandler | ||
| from typing import Dict, List, Optional, Tuple, Union | ||
| @@ -503,7 +504,8 @@ | ||
| Returns: | ||
| Raw GitHub URL string, or None if conversion is not applicable | ||
| """ | ||
| if not "github.com/" in url: | ||
| parsed = urllib.parse.urlparse(url) | ||
| if parsed.hostname != "github.com": | ||
| return None | ||
|
|
||
| # Case: https://github.com/triton-inference-server/server#triton-inference-server |
There was a problem hiding this comment.
agreed, this should probably be r"https?://(?:www\.)?github\.com"
|
|
||
| # Validate anchor if present (for GitHub markdown files where we can reliably validate) | ||
| # Other websites use various anchor generation schemes that are hard to predict | ||
| if anchor and raw_url_content and "github.com" in base_link: |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High documentation
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 10 days ago
In general, to fix incomplete URL substring sanitization, the code should parse the URL (e.g., with urllib.parse.urlparse) and inspect the hostname or netloc fields instead of searching for a host substring in the entire URL string. This avoids matching an allowed hostname that only appears in the path, query, fragment, or inside another hostname.
In this file, the problematic logic is:
if anchor and raw_url_content and "github.com" in base_link:We should instead parse base_link with urllib.parse.urlparse and check that the hostname is exactly github.com or possibly a subdomain such as raw.githubusercontent.com if that’s desired. To make a minimal, behavior-preserving change, we can restrict ourselves to hostname == "github.com" and retain the rest of the logic unchanged. Concretely:
-
Add an import for
urllib.parsenear the existingurllibimports. -
Replace the substring check with a host-based check:
from urllib.parse import urlparse # at top ... if anchor and raw_url_content: parsed = urlparse(base_link) if parsed.hostname == "github.com": # existing GitHub-specific logic
To avoid changing indentation and control flow more than necessary, we can more simply compute a boolean like
is_github_host = parsed.hostname == "github.com"and use it in place of"github.com" in base_link.
No other functions need to be modified, and no new project-level configuration is required.
| @@ -35,6 +35,7 @@ | ||
| import time | ||
| import urllib.error | ||
| import urllib.request | ||
| from urllib.parse import urlparse | ||
| from enum import Enum | ||
| from logging.handlers import RotatingFileHandler | ||
| from typing import Dict, List, Optional, Tuple, Union | ||
| @@ -816,15 +817,17 @@ | ||
|
|
||
| # Validate anchor if present (for GitHub markdown files where we can reliably validate) | ||
| # Other websites use various anchor generation schemes that are hard to predict | ||
| if anchor and raw_url_content and "github.com" in base_link: | ||
| # Check if it's a markdown file or a directory (which would have README.md) | ||
| is_markdown = base_link.endswith(".md") or ( | ||
| anchor and not _is_file_url(base_link) | ||
| ) | ||
| if is_markdown: | ||
| return _validate_anchor_in_content( | ||
| anchor, raw_url_content, is_markdown=True, slugger=slugger | ||
| if anchor and raw_url_content: | ||
| parsed_link = urlparse(base_link) | ||
| if parsed_link.hostname == "github.com": | ||
| # Check if it's a markdown file or a directory (which would have README.md) | ||
| is_markdown = base_link.endswith(".md") or ( | ||
| anchor and not _is_file_url(base_link) | ||
| ) | ||
| if is_markdown: | ||
| return _validate_anchor_in_content( | ||
| anchor, raw_url_content, is_markdown=True, slugger=slugger | ||
| ) | ||
|
|
||
| # For other URLs with anchors, consider valid if URL exists | ||
| # (we can't reliably validate anchors on non-GitHub sites) |
There was a problem hiding this comment.
again r"https?://(?:www\.)?github\.com"
There was a problem hiding this comment.
I can change to regex match if this is the intention. But URL may begin with github.com without r"https?".
There was a problem hiding this comment.
then r"(?:https?)?://(?:www\.)?github\.com" would work, no?
| README.md | ||
| examples/README.md | ||
| user_guide/perf_analyzer.md | ||
| model_navigator/CHANGELOG.md |
There was a problem hiding this comment.
Model navigator is stalled repository we not using it for quiet a while as per my knowledge, is something has changed?
cc: @whoisj
There was a problem hiding this comment.
There was a problem hiding this comment.
May be we should remove it from documentation?
There was a problem hiding this comment.
Let's change the repository visibilty to private then?
There was a problem hiding this comment.
seems like a reasonable approach.
fdc2749 to
5ffecd3
Compare
| Returns: | ||
| Raw GitHub URL string, or None if conversion is not applicable | ||
| """ | ||
| if not "github.com/" in url: |
There was a problem hiding this comment.
agreed, this should probably be r"https?://(?:www\.)?github\.com"
|
|
||
| # Validate anchor if present (for GitHub markdown files where we can reliably validate) | ||
| # Other websites use various anchor generation schemes that are hard to predict | ||
| if anchor and raw_url_content and "github.com" in base_link: |
There was a problem hiding this comment.
again r"https?://(?:www\.)?github\.com"
What does the PR do?
Added functions to validate links are valid and headings/anchors are present in the target markdown document. Fail
generate-html-documentationjob if find any.Checklist
<commit_type>: <Title>Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
Where should the reviewer start?
Test plan:
Caveats:
Background
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)