Skip to content

Feat/pre commit#44

Merged
btotharye merged 24 commits intomainfrom
feat/pre-commit
Mar 21, 2026
Merged

Feat/pre commit#44
btotharye merged 24 commits intomainfrom
feat/pre-commit

Conversation

@btotharye
Copy link
Copy Markdown
Owner

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Content pack addition/update
  • CI/CD or tooling change

Changes Made

Testing

  • Tests pass locally (make test)
  • Linting passes (make lint)
  • Formatting is correct (make format-check)
  • Tested manually with:

Related Issues

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing tests pass locally with my changes

Additional Notes

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 21, 2026

Codecov Report

❌ Patch coverage is 99.48849% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.95%. Comparing base (d06d228) to head (02b3466).
⚠️ Report is 25 commits behind head on main.

Files with missing lines Patch % Lines
greybeard/reporters/pdf.py 98.41% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #44      +/-   ##
==========================================
+ Coverage   89.61%   91.95%   +2.34%     
==========================================
  Files          22       25       +3     
  Lines        2070     2362     +292     
==========================================
+ Hits         1855     2172     +317     
+ Misses        215      190      -25     
Flag Coverage Δ
unittests 91.95% <99.48%> (+2.34%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Jarvis Bot and others added 4 commits March 21, 2026 06:27
- Add greybeard/reporters/pdf.py module with PDFReporter class
- Implement professional PDF styling with color palette
- Add convert_to_pdf() function to formatters module
- Update CLI to handle --format pdf with --output requirement
- Add pdf to OutputFormat type and SUPPORTED_FORMATS
- Add reportlab>=3.6 to project dependencies
- Update test_formatters.py for PDF format support

Features:
- Title page with metadata and generation timestamp
- Risk summary section with severity indicators (Critical/High/Medium)
- Detailed analysis: tradeoffs, questions, communication language
- Professional styling: color palette, typography, layout
- Support for all review modes and content packs
- Graceful error handling if reportlab not installed

Usage:
  git diff main | greybeard analyze --format pdf --output review.pdf
  greybeard analyze --repo . --format pdf --output review.pdf

The PDF reporter is production-ready with comprehensive error handling
and style guidelines. All existing tests updated for new format support.
Repository owner deleted a comment from github-actions bot Mar 21, 2026
Repository owner deleted a comment from github-actions bot Mar 21, 2026
Repository owner deleted a comment from github-actions bot Mar 21, 2026
Repository owner deleted a comment from github-actions bot Mar 21, 2026
Repository owner deleted a comment from github-actions bot Mar 21, 2026
Repository owner deleted a comment from github-actions bot Mar 21, 2026
Repository owner deleted a comment from github-actions bot Mar 21, 2026
Repository owner deleted a comment from github-actions bot Mar 21, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 21, 2026

🧙 Greybeard Review: staff-core

╭──────────────────────────────── 🧙 greybeard ────────────────────────────────╮
│ Mode: review | Pack: staff-core | LLM: │
│ anthropic/claude-haiku-4-5-20251001 │
╰──────────────────────────────────────────────────────────────────────────────╯
Warning: input is large (~102073 tokens estimated). Consider trimming or using --repo with a focused diff.

Staff Review: GitHub Actions + Pre-Commit Integration

Summary

This is a comprehensive feature addition enabling greybeard reviews to run in GitHub Actions (on PRs) and pre-commit hooks (on local commits). The implementation is substantial—multiple new modules, extensive test coverage, documentation, and infrastructure for risk gates, multi-pack reviews, and environment configuration. The code is well-structured and thoughtful. My concerns are operational: deployment safety, on-call burden, and the long-term cost of maintaining this complexity.


Key Risks

OPERATIONAL IMPACT

  1. Silent failures in CI/CD pipeline

    • GitHub Action workflow runs in parallel matrix jobs across multiple packs. If one pack fails to load, the entire job may fail silently or with unclear error messages.
    • Blast radius: Every PR blocked unexpectedly if a pack source is misconfigured or transient network error occurs.
    • Question: What happens if load_pack() fails mid-job? Does the action gracefully degrade or hard-fail the PR check?
  2. Unbounded LLM cost at scale

    • Diff truncation is capped at 400KB bytes, not tokens. An innocent 400KB diff could be 100k+ tokens on Claude, rapidly consuming budget.
    • Blast radius: Org-wide cost spike if many PRs have large diffs.
    • Question: Is there a monthly/daily budget limit or cost monitoring in place?
  3. Coordinated deployment risk for pre-commit hook

    • The pre-commit hook (greybeard-precommit diff/check) is installed via .pre-commit-hooks.yaml. If a greybeard release breaks the hook, all developers are blocked from committing until they update.
    • Blast radius: Team productivity loss; developers may bypass hooks or revert to unmanaged commits.
    • Observation: The .pre-commit-hooks.yaml file is simple but doesn't pin greybeard versions in a way that gives developers control.
  4. Observability gaps in async parallel jobs

    • GitHub Actions matrix jobs run in parallel. If one pack times out (15-min limit), others may still be running. The check status creation happens per-job, but the PR comment deduplication relies on finding "existing comments by pack." Race conditions are possible.
    • Question: Has this been tested under high concurrency (20+ matrix jobs)?

LONG-TERM OWNERSHIP

  1. Risk gates complexity without clear ownership

    • The pre-commit config allows per-file risk gates with branch skips (skip_if_branch) and pattern matching. This is powerful but creates a configuration burden.
    • Question: Who owns maintaining .greybeard-precommit.yaml as the codebase grows? Is there a way to version-control and test gate rules?
    • Assumption: Ops/platform teams will own this; but no runbook is provided for rotating or updating gates.
  2. Multiple entry points for the same logic

    • greybeard analyze (CLI), GitHub Actions workflow, pre-commit hook, and MCP server all invoke run_review(). Each has slightly different error handling, timeout, and credential management.
    • Observation: This creates maintenance burden. A bug in the analyzer affects all four surfaces.
  3. Tribal knowledge in .cover files

    • The diff includes many .py,cover files (e.g., greybeard/analyzer.py,cover). These appear to be coverage reports, but their presence in the PR is confusing. If they're meant to be committed, they add noise; if not, they shouldn't be in the diff.
    • Question: Are .cover files meant to be in the repo? If so, how are they maintained?

ON-CALL & HUMAN COST

  1. Alert fatigue from risk gate false positives

    • Risk gates can be configured to fail on "low" severity patterns (e.g., "risk", "concern"). This can block commits for innocuous changes.
    • Scenario: A developer adds a TODO comment with the word "risk" → gate fails → developer escalates or bypasses hook.
    • Question: Is there a mechanism for developers to override or temporarily skip gates without losing the safety net?
  2. Error messages are not actionable

    • When a pre-commit review fails, the output is:
      ✗ Review failed — commit blocked
      
      With only the raw LLM response as the message. Developers don't know why it failed or what to do to fix it.
    • Suggested improvement: Extract and highlight the top 3 blocking concerns in a structured format.
  3. Timeout handling is implicit

    • GitHub Actions jobs have a 15-minute timeout. If the LLM is slow or the diff is large, the job times out silently. No graceful degradation.
    • Question: What is the expected latency for a typical 1000-line diff on Anthropic? Has this been benchmarked?

WHO PAYS FOR THIS LATER?

  1. Multi-pack parallelism adds debugging complexity

    • The workflow runs 3 packs in parallel. If the PR comment deduplication fails, developers see 6 comments instead of 3. Debugging the comment update logic will be painful.
    • Observation: The deduplication logic uses a simple text search (f"Greybeard Review: {pack}"). This is fragile if the format changes.
  2. Environment variable sprawl

    • The system reads from:
      • ANTHROPIC_API_KEY / OPENAI_API_KEY / GREYBEARD_LLM_*
      • GREYBEARD_RISK_THRESHOLD, GREYBEARD_MODE, GREYBEARD_PACKS
      • GitHub context env vars (GITHUB_SHA, GITHUB_REF, etc.)
    • Each new feature will add more env vars. There's no schema or validation, only best-effort defaults.
    • Question: Is there a single source of truth for which env vars are required vs optional?
  3. Rollback story is unclear

    • If a bad release of the GitHub Action workflow breaks all PRs, the rollback is: revert the workflow file. But if the greybeard package has a bug, the rollback is more complex (reverts in .pre-commit-hooks.yaml, greybeard re-release, developers update locally).
    • Observation: There's no version pin in .pre-commit-hooks.yaml. A greybeard release instantly affects all developers.
  4. Test coverage is high but integration gaps remain

    • Unit tests are comprehensive (92.96% for github_action.py), but:
      • No integration test that exercises the full GitHub Action workflow end-to-end.
      • No test of the pre-commit hook under realistic conditions (large diff, slow LLM, network error).
      • No test of comment deduplication under race conditions.
    • Observation: The tests mock load_pack() and run_review(), so real behavior is not verified.

Tradeoffs

What's being traded Against Notes
Automated review on every commit/PR Increased operational complexity, cost, on-call burden Value: catches issues early; Cost: LLM per-PR pricing, failure modes, debugging
Multiple review perspectives (3 packs in parallel) Longer CI time, comment noise, parallel job coordination Value: staff + on-call + security views; Cost: 45sec per PR (assuming parallel), 3 potential failure points
Configurable risk gates per file Configuration burden, true positives vs false positives Value: ops can enforce rules (e.g., "infra/* must pass security review"); Cost: risk gate misconfiguration blocks commits, creates process overhead
Pre-commit hook integration Developer experience (hook overhead), adoption friction Value: shift-left, catch issues before they reach CI; Cost: 5-10 sec per commit, developers may disable if slow or noisy
Environment-based configuration No centralized config schema, hard to audit Value: flexible, no need to edit workflow files; Cost: it's easy to misconfigure, hard to debug

Recommendation: The tradeoffs are reasonable if the operational risks are mitigated. See suggestions below.


Questions to Answer Before Proceeding

Blocking questions

  1. Failure mode in GitHub Actions: What happens if a pack fails to load mid-job? Does the workflow fail the PR check, or does it continue with the other packs? If it continues, does the comment reflect which packs succeeded/failed?

  2. Cost control: Is there a monthly budget for LLM API calls via GitHub Actions? What's the expected cost per PR (for a typical 1000-line diff on each of 3 packs)? Has anyone benchmarked the token usage?

  3. Pre-commit hook adoption: Will this hook be mandatory or optional? If mandatory, what's the rollout plan to prevent developer friction (slow machines, network errors, LLM downtime)?

  4. Rollback safety: If greybeard releases a breaking change to run_review(), how quickly can it be rolled back? Is there a grace period for users to update?

Important clarifications

  1. .cover files: Are the .py,cover files (e.g., greybeard/analyzer.py,cover) meant to be in the repo? If yes, how are they maintained? If no, they should not be in this PR.

  2. Risk gate versioning: How are risk gate rules (RiskGate objects in .greybeard-precommit.yaml) tested before deployment? Can they be validated in CI?

  3. Comment deduplication: Under what conditions could the comment deduplication fail? (e.g., GitHub API rate limit, race condition if two jobs post simultaneously)

  4. Latency SLA: What is the target latency for a pre-commit hook? If it exceeds 10 seconds, developers may bypass it. Has this been measured?

Nice-to-haves

  1. Error message improvements: Can the pre-commit hook output be structured (JSON or YAML) so downstream tools (editors, CI) can parse it?

  2. Risk gate dry-run: Can risk gates be tested against a sample PR without actually blocking it? (e.g., greybeard precommit test --file infra/main.tf)

  3. Cost tracking: Can the GitHub Action log the token count for each review? This helps with cost forecasting.


Suggested Communication Language

When raising this with the author:

This is a thoughtful and comprehensive implementation. The test coverage and documentation are excellent. I have a few operational concerns that are worth addressing before merging:

  1. Failure modes in CI: Let's walk through what happens if a pack fails to load. Should we hard-fail the PR check, or continue with the other packs and report a partial result?

  2. Cost management: I'd like to see a brief analysis of the expected LLM cost per PR (for a typical diff on all 3 packs). Do we have budget controls in place?

  3. Pre-commit hook rollout: If we make this mandatory, we should have a clear rollout plan and an easy way for developers to disable/override it if it's too slow or noisy. Can we add a --skip flag?

  4. Risk gate testing: How do we ensure risk gate rules don't create false positives that block legitimate commits? Can we add a dry-run mode?

These are implementation details, not blockers. But addressing them will make the feature more robust and easier to operate in production.


Suggested Next Steps

  1. Add a failure-mode section to the GitHub Action: explicitly document what happens if a pack fails to load, with an example error message.

  2. Add cost tracking: Log the token count for each review (or at least the diff size) so we can forecast LLM costs.

  3. Add a --skip / --bypass flag to the pre-commit hook so developers can opt out without permanently disabling the hook.

  4. Clarify the .cover files: Confirm whether they're meant to be in the repo, or if they should be removed.

  5. Add an integration test that exercises the full GitHub Action workflow with a real diff and mock LLM response.

  6. Document the risk gate lifecycle: How are risk gate rules versioned, tested, and rolled out?


Assumptions Made

  • The GitHub Action is intended to run on every PR to main and develop.
  • The pre-commit hook is optional (not mandated on all repos).
  • The Anthropic API is the default backend; OpenAI is a secondary option.
  • Risk gates are managed by the platform/ops team, not individual developers.
  • The system is expected to degrade gracefully if the LLM is slow or unavailable (comment-only, not blocking).

Review generated by Greybeard on d41de5ef736a40bd53be769e8e6f711b4ecee327

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 21, 2026

⚠️ BLOCKING ISSUES DETECTED

📟 Greybeard Review: oncall-future-you

╭──────────────────────────────── 🧙 greybeard ────────────────────────────────╮
│ Mode: review | Pack: oncall-future-you | LLM: │
│ anthropic/claude-haiku-4-5-20251001 │
╰──────────────────────────────────────────────────────────────────────────────╯
Warning: input is large (~102073 tokens estimated). Consider trimming or using --repo with a focused diff.

Summary

This is a massive, complex diff touching dozens of files across the entire system—including GitHub Actions, pre-commit hooks, formatters, reporters, CLI modules, and extensive test suites. The review scope is enormous, and my assessment is necessarily high-level. I'm flagging serious operational and maintenance risks that will matter when this code hits production.


Key Risks

OPERATIONAL IMPACT — Silent Failures & Cascading Effects

1. GitHub Actions workflow has no runbook for failures

  • The workflow runs on every PR, parsing diffs and calling an LLM. If the LLM times out or returns garbage, the CI check fails silently with a vague error message in the Actions tab.
  • No alerting. No metrics. No way for an engineer to debug at 3am when it's flaky.
  • Who notices first? A developer whose PR mysteriously blocks because the check timed out. They'll retry, waste time, escalate to on-call.

2. Risk threshold pattern matching is fragile

  • Blocking logic is regex-based, pattern-matched against LLM output (detect_blocking_issues). Patterns are hardcoded: "production incident|data loss|security vulnerability|...".
  • If the LLM output changes formatting (e.g., uses "Production Incident" vs "production incident" or "PRODUCTION INCIDENT" in markdown), the regex silently fails to match.
  • Result: A dangerous change slips through because the pattern didn't fire. No alert. No audit trail.

3. Silent data loss in comment deduplication

  • find_existing_comment() searches for comments by pack name. But if a comment body is truncated or malformed, the search fails and a duplicate comment gets posted.
  • Multiple comments pile up on the same PR. Maintainers don't notice. GitHub's PR comment limit is eventually hit. Oldest comments are hidden.
  • Human cost: Conflicting advice in PR comments. Confusion about which review is current.

4. Pre-commit hooks can silently skip reviews

  • run_diff_review() catches all exceptions and returns passed=True (allowing the commit) if the LLM errors or the pack isn't found.
  • Scenario: Developer commits a risky change. The pre-commit hook tries to review it, but the LLM is down. Hook silently passes. Risky code ships.
  • No logging, no indication to the developer that the review was skipped.

LONG-TERM OWNERSHIP — Tribal Knowledge & Maintenance Burden

5. Anthropic model changed without migration path

  • config.py changed the default model from claude-3-5-sonnet-20241022 to claude-haiku-4-5-20251001 (Haiku instead of Sonnet).
  • Haiku is cheaper but significantly less capable. No migration guide. No deprecation notice. Existing users won't notice until their reviews become noticeably worse.
  • Who owns this? Someone will eventually ask, "Why did our reviews get worse?" and have to reverse-engineer that a model was downgraded.

6. Coverage files (*.py,cover) are cryptic

  • The diff includes coverage annotations (>, !, -) mixed into Python files. These are code coverage markers from coverage.py, not actual code.
  • New engineers will be confused: "Why are there > symbols at the start of lines? Is this valid Python?"
  • Documentation debt: The codebase now has invisible metadata embedded in file names.

7. Risk gate configuration is underspecified

  • Pre-commit config allows per-file risk gates (RiskGate), but the gate-matching logic is incomplete.
  • In run_risk_check(), gates are identified but the actual review and threshold checking is stubbed: # Placeholder: assume the gate would fail.
  • Maintenance hazard: Future engineer will implement gate checking, but it's not clear what the intended behavior is. They'll guess wrong.

ON-CALL & HUMAN COST — Pager Noise & Manual Recovery

8. No dashboard or observability for GitHub Actions

  • When the workflow fails, there's no place to look except the GitHub Actions log. No Datadog, no CloudWatch, no time-series data on success rates.
  • At 3am, an engineer pages because the PR workflow is timing out. They have to SSH into GitHub's infrastructure or... read the Actions log line by line.
  • Recovery time: 30+ minutes to understand why a review hung.

9. Blocking decisions are opaque

  • A PR is blocked because detect_blocking_issues() matched a pattern. But which pattern? The dev doesn't know.
  • The error message in the GitHub check is generic: "Blocking issues detected." No details. Dev has to read the PR comment carefully or run the review locally.
  • Incident cost: Dev escalates to staff engineer asking, "How do I unblock this?" Staff engineer has to re-review and decide if the issue is real or a false positive.

10. Manual recovery requires understanding two systems

  • If a PR is falsely blocked, recovery is manual: edit the config, change the threshold, and re-run the workflow.
  • Or: disable the entire pre-commit hook with PreCommitConfig(enabled=False) and commit that change.
  • Coordination overhead: Requires both a code change AND understanding of GitHub Actions / pre-commit config. Single-engineer teams will struggle.

WHO PAYS FOR THIS LATER?

11. Test suite is enormous but shallow

  • 600+ lines of tests covering many functions, but mostly mocking external dependencies (LLM calls, git).
  • No integration tests that run a real LLM call or generate a real diff and trace it through the full flow.
  • Result: Tests pass locally, but prod failures surface that weren't caught. Engineer has to debug with real data.

12. Configuration is distributed and inconsistent

  • GitHub Actions workflow reads from env vars.
  • Pre-commit hook reads from .greybeard-precommit.yaml AND env vars.
  • CLI reads from ~/.greybeard/config.yaml AND env vars.
  • Cognitive load: Three different config sources. Developer has to understand all three and debug config issues across all three.

13. PDF export depends on optional dependency (reportlab)

  • to_pdf() calls PDFReporter, which raises ImportError if reportlab isn't installed.
  • No clear error message or recovery path. User sees a cryptic traceback.
  • Support cost: "I tried to export to PDF and got an error. What do I do?"

14. Rollback is complex and error-prone

  • Disabling the workflow is straightforward (delete the YAML file).
  • But disabling pre-commit hooks requires merging a config change.
  • If a production incident is caused by a bad review blocking a critical fix, you need to unblock the review AND roll back the hook config at the same time.
  • Time under incident: 15+ minutes if automated recovery isn't in place.

Tradeoffs

What's Being Traded Against Cost
Comprehensive risk detection (multiple packs, multiple thresholds) Observability and debuggability Silent failures, opaque blocking logic, no audit trail
Local pre-commit integration (catch issues before PR) Reliability and error handling Exceptions are swallowed; failed reviews silently allow commits
Multiple configuration sources (env vars, YAML, defaults) Consistency and maintainability Three config sources to understand; no single source of truth
LLM-based pattern matching (natural language understanding) Predictability and simplicity Fragile regex; output format changes break detection
Rich output formats (PDF, HTML, Jira, JSON) Maintainability burden Extra code, extra dependencies, extra test matrix

Questions to Answer Before Proceeding

  1. How will you know when the GitHub Actions workflow is broken?

    • Who monitors the success rate of the workflow?
    • What's the SLO? (Currently: implicit "it shouldn't break PRs," but no metric.)
    • What dashboard or alert will fire if the workflow times out 10 times in a row?
  2. What's the recovery path if a PR is incorrectly blocked?

    • How long does it take to unblock a PR in an incident?
    • Can you unblock it without a code change, or must the dev/maintainer change the threshold and re-run?
    • Who is authorized to override the blocking decision?
  3. How will you handle LLM output format changes?

    • The risk detection regex assumes a specific format from the LLM. What if Claude changes its output format?
    • What's the test coverage for regex patterns against real LLM output?
    • How do you version this or detect breaking changes before they hit prod?
  4. Why does pre-commit hook swallow exceptions and return passed=True?

    • What's the intent: "Never block a commit if review fails" or "Never crash the hook"?
    • If it's the former, that's a security issue. If it's the latter, you should log and alert.
    • Can you add verbose logging or a separate "check" mode that doesn't swallow errors?
  5. How will the team handle the Anthropic model downgrade?

    • Users with the old config will be using Sonnet; new users will be using Haiku.
    • How will you coordinate the migration and explain the capability difference?
    • Is there a migration guide or deprecation notice?
  6. What's the contract for format_pr_comment() truncation?

    • It truncates at 60,000 chars, but GitHub's comment size limit is different.
    • What happens if a review is exactly 60,001 chars? Does it get silently truncated with no notice?
    • Should this be configurable?
  7. How do you test the full workflow end-to-end?

    • The test suite mocks all external calls (LLM, git, load_pack).
    • How do you catch integration issues like "the GitHub Actions output format changed" or "env var isn't being passed correctly"?
    • Should you have a CI job that runs the real workflow on a test PR?
  8. What happens if the risk threshold env var is malformed?

    • get_risk_threshold("extreme") returns the default "high".
    • Should this be an error instead? How will you know if a typo sneaked into the GitHub Actions config?
  9. Who owns the pre-commit hook configuration per repository?

    • Is .greybeard-precommit.yaml committed to the repo or shared globally?
    • If per-repo, who is responsible for maintaining it? Do all repos need it?
    • If it gets out of sync with the code, how do you catch that?
  10. How will you handle breaking changes to the workflow or config format?

    • If you change the YAML schema for risk gates, how do you migrate existing repos?
    • Is there a version field in the config?

Suggested Communication Language

To the team proposing this:

"This is ambitious and well-tested locally. I see three critical asks before we ship:

1. Observability: Add metrics for workflow success rate, LLM latency, and blocking decision frequency. We need a dashboard before this hits production.

2. Graceful degradation: Pre-commit hooks currently swallow errors and allow commits. That's a security boundary we're weakening. Either (a) add alerting when a review fails, or (b) add a --strict mode that fails the hook and lets the user decide whether to skip the review or investigate.

3. Runbook: Document the manual recovery steps for a falsely-blocked PR, an LLM outage, and workflow timeouts. Who has permission to unblock a PR? How long should it take?

Once those three are in place, I'm confident this will add value. As-is, it will surprise us at 3am."


Assumptions

  • I assume the *.py,cover files are code coverage reports and should not be committed to the repo as source files.
  • I assume GitHub Actions is the system of record for PR reviews, not pre-commit hooks (pre-commit is a convenience, GitHub Actions is the gate).
  • I assume the risk threshold patterns are documented somewhere (they're not in this diff), and the team knows what output format the LLM produces.
  • I assume reportlab is optional and a graceful error is acceptable if it's not installed (rather than a hard requirement).

Review generated by Greybeard on d41de5ef736a40bd53be769e8e6f711b4ecee327

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 21, 2026

🔒 Greybeard Review: security-reviewer

╭──────────────────────────────── 🧙 greybeard ────────────────────────────────╮
│ Mode: review | Pack: security-reviewer | LLM: │
│ anthropic/claude-haiku-4-5-20251001 │
╰──────────────────────────────────────────────────────────────────────────────╯
Warning: input is large (~102073 tokens estimated). Consider trimming or using --repo with a focused diff.

Summary

This is a massive pull request adding comprehensive GitHub Actions integration to greybeard, plus pre-commit hook support, PDF reporting, expanded test coverage, and numerous supporting modules. The scope is substantial: new files, refactored modules, test expansion from 40 tests to 600+. The changes are generally well-structured and documented, but there are several operational and ownership risks to surface before merging.


Key Risks

1. Undocumented state transition in GitHub Actions workflow (OPERATIONAL)

  • The workflow creates separate check runs for each pack (staff-core, on-call, security) but the blocking logic in .github/workflows/greybeard-review.yml does regex matching on review output to detect blocking issues.
  • Attack class: If the LLM response format changes slightly (e.g., a pack outputs "⚠️ CRITICAL ISSUE" instead of matching the hardcoded patterns), checks will silently pass when they should block.
  • Impact: False negatives on PR blocking; dangerous changes merge unreviewed.
  • Fix: Either (a) parse structured JSON output from the analyzer instead of regex on markdown, or (b) add integration tests that verify the blocking patterns match real analyzer output for each pack.

2. Anthropic model downgrade in config.py (OPERATIONAL)

  • Line 25: default model changed from claude-3-5-sonnet-20241022 to claude-haiku-4-5-20251001.
  • Problem: Haiku is a much smaller model with lower quality. This breaks the implicit contract that greybeard produces "staff-level" reviews. Users running the workflow may see degraded review quality without realizing why.
  • Impact: Surprise quality drop for Anthropic users; difficult to debug why reviews are suddenly weaker.
  • Fix: Either revert to Sonnet (recommended) or make a deliberate migration with a changelog entry and user warning. Don't bury model downgrades in code changes.

3. No rate limiting or quota management in GitHub Actions (OPERATIONAL COST)

  • The workflow runs on every PR with no built-in rate limiting, quota checks, or cost estimation.
  • Failure mode: A malicious actor or mistake (e.g., force-pushing 100 commits) triggers 100 concurrent reviews × 3 packs = 300 LLM API calls, potentially costing $10-100+ depending on model and token usage.
  • Blast radius: Unbounded API costs; no circuit breaker.
  • Fix:
    • Add a max_diff_size check; skip reviews if diff > 1MB.
    • Document estimated monthly cost in setup guide.
    • Add optional GitHub secret for cost per 1K tokens so users can set spend limits upstream.

4. Missing error handling for malformed YAML configs (OPERATIONAL)

  • PreCommitConfig.load() uses yaml.safe_load() without try/catch. If .greybeard-precommit.yaml is corrupted or invalid, the workflow crashes with a cryptic YAML error instead of a helpful message.
  • Impact: On-call engineer debugging a failed commit hook sees YAML parser traceback, not "Your config is broken here."
  • Fix: Wrap yaml.safe_load() in try/except and log a clear error message with the file path and line number.

5. Auth/authz boundary: LLM API keys in secrets but readable in workflow logs (SECURITY)

  • The workflow sets ANTHROPIC_API_KEY and OPENAI_API_KEY from GitHub secrets. These are automatically masked in logs, but if a reviewer uses --verbose flag or the LLM returns the key in an error message, it could leak.
  • Likelihood: Low, but not zero.
  • Fix: Don't pass API keys as environment variables into the analyzer; instead, greybeard should read them from the environment at runtime. Add a pre-check that validates the key is set without printing it.

6. No pagination/truncation safety for large reviews (OPERATIONAL)

  • GitHub PR comment body limit is 65,536 characters. The workflow truncates at 60,000 chars, but a very long review from 3 packs could exceed this when combined.
  • Impact: Comments silently fail to post, confusing users who expect feedback.
  • Fix: Aggregate all three pack reviews into a single comment or use threads (GitHub's native reply feature). If using threads, handle the case where the parent comment itself is too large.

7. Tribal knowledge in pre-commit hook config format (OWNERSHIP)

  • .greybeard-precommit.yaml format with risk gates, skip patterns, and pack lists is not documented in the code or in a schema. The YAML structure is inferred from PreCommitConfig.load().
  • Impact: In 6 months, a new engineer on the team can't modify the config without reading Python code or trial-and-error.
  • Fix: Create a JSON schema or document the exact YAML format in .github/GREYBEARD_PRECOMMIT_CONFIG_SCHEMA.md with examples.

8. PDF reporter uses reportlab (optional dependency) but no graceful fallback (OPERATIONAL)

  • greybeard/reporters/pdf.py imports reportlab and raises ImportError if not installed. Users who run greybeard analyze --format pdf without installing reportlab see a cryptic error.
  • Impact: Feature advertised but not working; confusion during setup.
  • Fix: Either make reportlab a required dependency (add to pyproject.toml), or provide a clear error message: "PDF export requires reportlab. Install with: pip install reportlab".

9. No idempotency guarantee for comment updates (OPERATIONAL)

  • The GitHub Actions script finds existing comments by checking if the body contains Greybeard Review: {pack}. If this string appears in user text, it will incorrectly match and overwrite a user comment.
  • Likelihood: Low but nonzero.
  • Fix: Use a unique bot comment marker, e.g., a hidden HTML comment: <!-- greybeard-{pack}-{commit-sha} -->.

10. Mock-heavy test suite may hide integration bugs (TEST QUALITY)

  • Tests use @patch extensively (e.g., mocking load_pack, run_review). This means the tests don't actually validate that the GitHub Actions workflow can integrate with real analyzer code.
  • Impact: Workflow may fail at runtime even if all tests pass.
  • Fix: Add at least one integration test that uses real (or stubbed) analyzer/pack code, not mocks. For example, test the full flow: get_staged_files() → read_diff() → load_pack("staff-core") → run_review() → format_comment().

11. Dependency on external GitHub API with no retry logic (OPERATIONAL)

  • The workflow calls github.rest.issues.listComments() and github.rest.issues.updateComment() without retry logic. If GitHub API is temporarily unavailable (rare but possible), the workflow fails and the commit is blocked.
  • Impact: False negatives on availability; users blame the tool instead of GitHub.
  • Fix: Wrap API calls in a simple retry loop with exponential backoff (3 retries, 1s→2s→4s).

12. Cost implications not documented (OWNERSHIP)

  • Running 3 packs × N PRs per month × ~2K tokens per review = significant API costs. There's no cost estimate in setup docs or warnings about GPT-4o pricing.
  • Impact: Teams adopt the tool and later discover unexpected bills.
  • Fix: Add cost calculator to GREYBEARD_ACTION_SETUP.md: "At 50 PRs/month with 3 packs, expect ~$50-150/month with GPT-4o, $5-10/month with Haiku or local Ollama."

Tradeoffs

What's being traded Against what Verdict
Model downgrade (Sonnet → Haiku) Cost reduction Not worth it without explicit user consent. Quality is the core value prop.
LLM timeout at 15 minutes Preventing hung workflows Reasonable, but should be configurable and logged.
3 parallel packs (staff-core, on-call, security) Execution time & cost Good default, but let users override with GREYBEARD_PACKS env var. ✓ Already done.
Truncate diffs at 400KB Token limits Reasonable for cost management, but document the risk that truncated diffs may miss important context.
Always run on open/sync/ready_for_review Skip reviews on WIP/draft Good—respects GitHub's draft PR feature. ✓

Questions to Answer Before Proceeding

  1. Why was the Anthropic model downgraded from Sonnet to Haiku? Was this intentional, or a bug? If intentional, this needs a changelog entry and explicit user warning (e.g., a banner in the setup guide).

  2. How will you handle concurrent requests if 100 PRs are opened in rapid succession? Should there be a queue or a per-repo rate limit?

  3. What's the intended SLA for the workflow? Should it complete in <1min, <5min, <15min? If LLM is slow, will the PR be unblocked by timeout?

  4. For teams with many PRs per day, what's the monthly cost estimate? Should this be in the README or setup guide?

  5. Are there any secrets (API keys, auth tokens) that could leak in error messages or logs? Run a manual check of the LLM error responses.

  6. How will you version the GitHub Actions workflow? Should users pin to a specific release, or does @main auto-update? (Risk: auto-update could introduce breaking changes.)

  7. Pre-commit hook: if greybeard-precommit diff fails due to a missing pack, should the commit be blocked or allowed? Currently it's allowed (line 141-146 in precommit.py). Is that the intended behavior?

  8. PDF reporter: should reportlab be a required dependency or optional? If optional, update the error message and docs.


Suggested Communication Language

For the team:

"We're adding GitHub Actions and pre-commit integration for greybeard reviews. Before merging, please note:

  1. The Anthropic default model was changed from Sonnet to Haiku. Please confirm this was intentional—it significantly impacts review quality.
  2. There's no API cost management built in yet. Running this on 50+ PRs/month could cost $50-200/month depending on model. Consider setting up cost alerts in your cloud provider.
  3. Pre-commit hook behavior: if a pack can't load, the commit is allowed (fail-open). This is intentional for CI resilience, but document it clearly.
  4. If you enable PDF export, make sure your environment has reportlab installed, or we'll need to make it required.

After merge, we should add:

  • Cost estimation docs
  • JSON schema for .greybeard-precommit.yaml
  • Integration tests (not just mocks)"

For code review:

"This is a solid feature addition with good test coverage. The main risks are operational (API costs, config docs, model downgrade) rather than code quality. I'd recommend addressing points 1, 2, and 7 before merge, and the rest as follow-up PRs."


Assumptions

  • GitHub Actions is the primary CI/CD target. No mention of GitLab CI, CircleCI, or other platforms.
  • greybeard's analyzer is stable. Tests mock it, so we don't validate end-to-end integration with real packs/LLMs.
  • Users will configure the workflow correctly. Minimal validation of GREYBEARD_RISK_THRESHOLD or other env vars.
  • API keys are managed securely by GitHub. We trust GitHub's secret masking and don't add extra encryption.

Review generated by Greybeard on d41de5ef736a40bd53be769e8e6f711b4ecee327

…n pinning docs

- Workflow: continue-on-error on analyze step, neutral check on failure,
  post 'Review unavailable' error comment with Actions log link
- Workflow: reduce diff truncation from 400KB to 200KB (~50k tokens)
- .pre-commit-hooks.yaml: strong version pinning warning + kill switch docs
- docs/guides/github-actions.md: Reliability & Cost Controls section,
  expanded troubleshooting (neutral check, pre-commit kill switch, billing limit)
…chitecture, cover files

- .gitignore: exclude *.py,cover and *.cover (coverage annotate artifacts)
- docs/guides/pre-commit.md: new guide covering risk gate ownership model,
  version control of .greybeard-precommit.yaml, testing gates, runbook
  template, emergency bypass (skip_if_branch), kill switch, version pinning
- docs/contributing.md: add architecture section mapping the four entry
  points (CLI, GitHub Actions, pre-commit, MCP) to their source files;
  explains that analyzer bugs affect all surfaces and where to make changes
- mkdocs.yml: add Pre-commit Hooks guide to nav
Workflow: wrap greybeard analyze in timeout 8m so slow API exits with
code 124; continue-on-error posts neutral check and Review unavailable
comment instead of hanging silently for 15 minutes.

precommit.py: format_review_output adds a structured What to do block
on failure with verbose flag, SKIP= per-commit bypass, and kill switch.

docs pre-commit.md: new False Positives section with threshold calibration
table, excluded_paths, non-blocking docs gate, SKIP= escape hatch, and
false positive tracking process. New Reading the Output section with
annotated example. Troubleshooting for innocuous-word blocks.

docs github-actions.md: latency table by diff size and model, explains
8m timeout behaviour and recovery. Troubleshooting for Job timed out.
…cking docs

precommit.py: wrap yaml.safe_load in try/except; malformed config now
raises SystemExit with file path, line number, and how-to-fix hint
instead of a raw YAML traceback. Test added.

workflow: add key validation step that fails fast with a clear message if
ANTHROPIC_API_KEY secret is not set.

workflow: diff size guard — if raw diff > 1MB, post a skip comment and
abort before running the LLM. Prevents unbounded API cost on large PRs.
Label-based triggering already kills the concurrent-run scenario; this
is the per-PR guard. diff step now captures orig_size and too_large flag.

workflow: idempotent comment updates now match on a unique hidden HTML
marker <!-- greybeard-bot:PACK --> instead of body.includes(pack name),
preventing false overwrites of user comments that happen to mention a
pack name. Removes duplicate const icons/reviewFailed declarations.

config.py: comment explaining the deliberate Haiku default (cost/speed)
and how to override with claude-sonnet-4-6.

CHANGELOG.md: [Unreleased] section documenting the Haiku model change
as a deliberate quality/cost trade-off, not a silent regression.

docs github-actions.md: callout noting the regex-based blocking detection
is advisory, not deterministic. Documents RISK_THRESHOLD and the roadmap
for structured JSON output.

docs pre-commit.md: Configuration Reference section — complete annotated
YAML schema for .greybeard-precommit.yaml so engineers can modify
config without reading Python source.

Skipped (already addressed): #6 comment size limit (separate per-pack
comments, each capped at 60k), #8 PDF graceful fallback (_check_reportlab
raises clear ImportError with pip install hint).
@btotharye btotharye merged commit f413aa3 into main Mar 21, 2026
7 checks passed
@btotharye btotharye deleted the feat/pre-commit branch March 21, 2026 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant