A production-grade, job-level CI observability engine for GitHub Actions with alerting, notifications, and a centralized dashboard for monitoring CI health across workflows.
CI Sentinel addresses a critical gap in GitHub Actions observability: while GitHub shows whether workflows pass or fail, it doesn't expose job-level health patterns, architecture-specific failures, or systemic CI degradation. This tool ingests workflow run data, normalizes job states, computes metrics, and provides proactive alerting for CI regressions.
┌─────────────────────────────────────────────────────────────────────────────┐
│ CI SENTINEL │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────────┐ ┌──────────────┐ │
│ │ GitHub │───▶│ Ingestion │───▶│ Storage │ │
│ │ Actions │ │ Engine │ │ (SQLite) │ │
│ │ API │ │ │ │ │ │
│ └─────────────┘ └─────────────────┘ └──────┬───────┘ │
│ │ │
│ ┌─────────────────┐ │ │
│ │ Normalization │◀──────────┘ │
│ │ Layer │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Metrics │ │ Alerting │ │ API Layer │ │
│ │ Engine │ │ Engine │ │ (HTTP/JSON) │ │
│ │ │ │ │ │ │ │
│ └──────────────┘ └──────┬───────┘ └───────┬───────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌───────────────┐ │
│ │ Notification │ │ Dashboard │ │
│ │ Sinks │ │ UI │ │
│ │ │ │ │ │
│ │ - Slack │ └───────────────┘ │
│ │ - Webhook │ │
│ │ - Log │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Job-Level Analysis: Focuses on individual job results rather than workflow-level rollups, exposing failures in matrix builds and non-blocking jobs.
Architecture Divergence Detection: Identifies when specific variants (e.g., arm64, amd64, Windows) fail while others pass in the same run.
Health Metrics: Computes pass rate, flakiness score, and stability indicators over a configurable sliding window.
PR Attribution: Links failures to merged pull requests using GitHub's event metadata and commit lookup.
Nightly Failure Detection: Automatically detects when nightly/scheduled workflows fail after previously passing.
Regression Alerts: Triggers critical alerts when mainline branches (main/master) regress from pass to fail.
Sustained Failure Tracking: Warns when workflows fail multiple consecutive times.
Architecture Divergence Alerts: Notifies when jobs in the same logical group have inconsistent status across variants.
Webhook: POST JSON payloads to any HTTP endpoint for custom integrations.
Slack: Formatted messages with severity-based color coding and actionable links.
Log: Stdout logging for development and debugging.
Overview: Aggregated health of all tracked workflows with category and status filters.
Nightly Status: Dedicated view for scheduled/nightly workflows with regression highlighting.
Alerts: Active and historical alerts with acknowledgment workflow.
Workflow Detail: Deep-dive into individual workflow history, runs, and job matrix.
Run Detail: Job-level breakdown with architecture grouping and divergence visualization.
- Go 1.21 or higher
- GitHub personal access token with
repoandworkflowscopes
git clone https://github.com/your-username/ci-sentinel.git
cd ci-sentinel
go build -o ci-sentinel cmd/ci-sentinel/main.goCreate a config.yaml file:
owner: your-org
repo: your-repo
database: sentinel.db
user_agent: ci-sentinel-v1.0
ingestion:
lookback_days: 30
rate_limit_buffer: 100
alerts:
cooldown_minutes: 60
flaky_threshold: 3
consecutive_failure_threshold: 3
notifications:
sinks:
- type: slack
webhook_url: ${SLACK_WEBHOOK_URL}
channel: "#ci-alerts"
- type: webhook
url: https://your-webhook-endpoint.com/ci
headers:
Authorization: "Bearer ${WEBHOOK_TOKEN}"
- type: log
dashboard:
port: 8080
default_window: 10Set your GitHub token:
export GITHUB_TOKEN=your_github_tokenFetch and store workflow run data from GitHub:
./ci-sentinel ingest --config config.yamlThis command also evaluates alert conditions and dispatches notifications for any triggered alerts.
Generate health metrics for a workflow:
./ci-sentinel analyze --config config.yaml --workflow <WORKFLOW_ID> --window 10Output includes:
- Pass rate percentage
- Flakiness score (status transitions)
- Architecture divergence warnings
- Recent run status
Launch the web interface:
./ci-sentinel serve --config config.yaml --port 8080Access at http://localhost:8080
Set workflow category and notification preferences:
./ci-sentinel configure --config config.yaml --workflow <WORKFLOW_ID> \
--category nightly \
--priority 1 \
--notifyCategories: nightly, pr, release, other
List active alerts:
./ci-sentinel alerts --config config.yaml --unacknowledgedAcknowledge an alert:
./ci-sentinel ack --config config.yaml <ALERT_ID>| Endpoint | Method | Description |
|---|---|---|
/api/workflows |
GET | List all workflows with health metrics |
/api/workflows?category=nightly |
GET | Filter by category |
/api/workflows?status=failing |
GET | Filter by status (failing/flaky/healthy) |
/api/workflow/{id} |
GET | Workflow detail with runs and config |
/api/nightly |
GET | Nightly workflows with regression detection |
/api/alerts |
GET | List alerts |
/api/alerts?unacknowledged=true |
GET | Unacknowledged alerts only |
/api/alerts/{id} |
GET | Single alert detail |
/api/alerts/{id}/ack |
POST | Acknowledge an alert |
/api/run/{id} |
GET | Run detail with job matrix |
/api/trends/{id} |
GET | Daily metrics for trend analysis |
/api/health |
GET | Health check endpoint |
Ingestion Service: Fetches workflows, runs, and jobs from GitHub Actions API with rate limit awareness.
Metrics Engine: Computes health statistics using deterministic heuristics.
Alerting Engine: Evaluates runs against alert conditions with cooldown and deduplication.
Notification Dispatcher: Routes alerts to configured sinks (Slack, Webhook, Log).
Storage Layer: SQLite database with schema for workflows, runs, jobs, alerts, and metrics.
API Server: HTTP endpoints serving dashboard and metrics data.
Workflows: Repository CI workflows with category classification (nightly/pr/release).
Workflow Runs: Individual executions triggered by push, PR, or schedule events.
Jobs: Atomic units of work within a run, including matrix variants with logical grouping.
Alerts: Generated notifications with severity, type, and acknowledgment state.
Daily Metrics: Precomputed aggregations for trend visualization.
| Type | Trigger | Severity |
|---|---|---|
nightly_failure |
Nightly workflow fails after passing | Critical |
regression |
Mainline branch (main/master) regresses | Critical |
sustained_failure |
3+ consecutive failures | Warning |
divergence |
Architecture variants have inconsistent status | Warning |
- Deduplication: One alert per (workflow, run, type) combination
- Cooldown: Configurable minimum interval between repeat alerts
- Suppression: Manual rules to silence known issues
- Priority: Critical alerts sorted before warnings
GitHub Actions statuses are normalized to three states:
- Pass:
success - Fail:
failure,timed_out,action_required - Skip:
cancelled,skipped,neutral
Pass Rate: Percentage of passing runs in the analysis window
Flakiness: Number of status transitions between consecutive runs
Stability: Workflow is marked unstable if pass rate is between 40% and 90%
Architecture Divergence: Detected when jobs in the same logical group have different statuses across variants
docker build -t ci-sentinel .
docker run -p 8080:8080 -v $(pwd)/config.yaml:/config.yaml -e GITHUB_TOKEN=$GITHUB_TOKEN ci-sentinel serveDeploy using the provided manifests:
kubectl apply -f k8s/Configuration:
deployment.yaml: API server deploymentcronjob.yaml: Scheduled ingestionconfigmap.yaml: Configuration values
Nightly CI Monitoring: Track mainline health separate from PR noise
Matrix Build Validation: Ensure all platform variants pass consistently
Flakiness Detection: Identify tests that flip between pass and fail states
Release Readiness: Verify CI stability before cutting releases
Backend-First Architecture: UI is a thin consumer of the metrics engine.
No Log Ingestion: Focuses on metadata and state, not log content. Links to GitHub for log access.
Deterministic Logic: No ML or probabilistic models, only rule-based heuristics.
API Safety: Respects GitHub rate limits with caching and incremental updates.
Signal Over Noise: Nightly failures are prioritized over PR failures; cooldowns prevent alert spam.
- Single repository per instance (multi-repo planned)
- Polling-based updates (no real-time streaming)
- No workflow re-execution capabilities
- No log analysis or test result parsing
| Section | Option | Description | Default |
|---|---|---|---|
| Root | owner |
GitHub repository owner | Required |
| Root | repo |
GitHub repository name | Required |
| Root | database |
SQLite database file path | sentinel.db |
| Root | window |
Default runs for health calculation | 10 |
ingestion |
lookback_days |
Days of history to fetch | 30 |
ingestion |
rate_limit_buffer |
Reserved API calls | 100 |
alerts |
cooldown_minutes |
Minimum time between repeat alerts | 60 |
alerts |
flaky_threshold |
Transitions to mark as flaky | 3 |
alerts |
consecutive_failure_threshold |
Failures for sustained alert | 3 |
notifications.sinks[] |
type |
Sink type: slack, webhook, log |
- |
dashboard |
port |
HTTP server port | 8080 |
ci-sentinel/
├── cmd/ci-sentinel/ CLI entry point
├── internal/
│ ├── alerting/ Alert detection engine
│ ├── config/ Configuration loading
│ ├── github/ GitHub API client
│ ├── ingest/ Data ingestion service
│ ├── metrics/ Health computation engine
│ ├── notify/ Notification sinks (Slack, Webhook, Log)
│ ├── server/ HTTP server and API
│ │ └── static/ Dashboard UI assets
│ └── storage/ SQLite database layer
├── k8s/ Kubernetes manifests
└── Dockerfile Container image definition
go test ./...go build -o ci-sentinel cmd/ci-sentinel/main.go