Skip to content

Vaibhav701161/CI-Centinal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CI Sentinel

A production-grade, job-level CI observability engine for GitHub Actions with alerting, notifications, and a centralized dashboard for monitoring CI health across workflows.

Overview

CI Sentinel addresses a critical gap in GitHub Actions observability: while GitHub shows whether workflows pass or fail, it doesn't expose job-level health patterns, architecture-specific failures, or systemic CI degradation. This tool ingests workflow run data, normalizes job states, computes metrics, and provides proactive alerting for CI regressions.

┌─────────────────────────────────────────────────────────────────────────────┐
│                              CI SENTINEL                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌─────────────────┐    ┌──────────────┐                │
│  │   GitHub    │───▶│   Ingestion     │───▶│   Storage    │                │
│  │  Actions    │    │    Engine       │    │   (SQLite)   │                │
│  │    API      │    │                 │    │              │                │
│  └─────────────┘    └─────────────────┘    └──────┬───────┘                │
│                                                    │                        │
│                     ┌─────────────────┐           │                        │
│                     │  Normalization  │◀──────────┘                        │
│                     │     Layer       │                                     │
│                     └────────┬────────┘                                     │
│                              │                                              │
│         ┌────────────────────┼────────────────────┐                        │
│         ▼                    ▼                    ▼                        │
│  ┌──────────────┐    ┌──────────────┐    ┌───────────────┐                 │
│  │   Metrics    │    │   Alerting   │    │   API Layer   │                 │
│  │   Engine     │    │    Engine    │    │   (HTTP/JSON) │                 │
│  │              │    │              │    │               │                 │
│  └──────────────┘    └──────┬───────┘    └───────┬───────┘                 │
│                             │                    │                          │
│                             ▼                    ▼                          │
│                      ┌──────────────┐    ┌───────────────┐                 │
│                      │ Notification │    │   Dashboard   │                 │
│                      │   Sinks      │    │      UI       │                 │
│                      │              │    │               │                 │
│                      │ - Slack      │    └───────────────┘                 │
│                      │ - Webhook    │                                       │
│                      │ - Log        │                                       │
│                      └──────────────┘                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Key Features

Core Capabilities

Job-Level Analysis: Focuses on individual job results rather than workflow-level rollups, exposing failures in matrix builds and non-blocking jobs.

Architecture Divergence Detection: Identifies when specific variants (e.g., arm64, amd64, Windows) fail while others pass in the same run.

Health Metrics: Computes pass rate, flakiness score, and stability indicators over a configurable sliding window.

PR Attribution: Links failures to merged pull requests using GitHub's event metadata and commit lookup.

Alerting System

Nightly Failure Detection: Automatically detects when nightly/scheduled workflows fail after previously passing.

Regression Alerts: Triggers critical alerts when mainline branches (main/master) regress from pass to fail.

Sustained Failure Tracking: Warns when workflows fail multiple consecutive times.

Architecture Divergence Alerts: Notifies when jobs in the same logical group have inconsistent status across variants.

Notification Channels

Webhook: POST JSON payloads to any HTTP endpoint for custom integrations.

Slack: Formatted messages with severity-based color coding and actionable links.

Log: Stdout logging for development and debugging.

Dashboard Views

Overview: Aggregated health of all tracked workflows with category and status filters.

Nightly Status: Dedicated view for scheduled/nightly workflows with regression highlighting.

Alerts: Active and historical alerts with acknowledgment workflow.

Workflow Detail: Deep-dive into individual workflow history, runs, and job matrix.

Run Detail: Job-level breakdown with architecture grouping and divergence visualization.

Installation

Prerequisites

  • Go 1.21 or higher
  • GitHub personal access token with repo and workflow scopes

Build

git clone https://github.com/your-username/ci-sentinel.git
cd ci-sentinel
go build -o ci-sentinel cmd/ci-sentinel/main.go

Configuration

Create a config.yaml file:

owner: your-org
repo: your-repo
database: sentinel.db
user_agent: ci-sentinel-v1.0

ingestion:
  lookback_days: 30
  rate_limit_buffer: 100

alerts:
  cooldown_minutes: 60
  flaky_threshold: 3
  consecutive_failure_threshold: 3

notifications:
  sinks:
    - type: slack
      webhook_url: ${SLACK_WEBHOOK_URL}
      channel: "#ci-alerts"
    - type: webhook
      url: https://your-webhook-endpoint.com/ci
      headers:
        Authorization: "Bearer ${WEBHOOK_TOKEN}"
    - type: log

dashboard:
  port: 8080
  default_window: 10

Set your GitHub token:

export GITHUB_TOKEN=your_github_token

Usage

Ingest Data

Fetch and store workflow run data from GitHub:

./ci-sentinel ingest --config config.yaml

This command also evaluates alert conditions and dispatches notifications for any triggered alerts.

Analyze CI Health

Generate health metrics for a workflow:

./ci-sentinel analyze --config config.yaml --workflow <WORKFLOW_ID> --window 10

Output includes:

  • Pass rate percentage
  • Flakiness score (status transitions)
  • Architecture divergence warnings
  • Recent run status

Start Dashboard

Launch the web interface:

./ci-sentinel serve --config config.yaml --port 8080

Access at http://localhost:8080

Configure Workflows

Set workflow category and notification preferences:

./ci-sentinel configure --config config.yaml --workflow <WORKFLOW_ID> \
  --category nightly \
  --priority 1 \
  --notify

Categories: nightly, pr, release, other

Manage Alerts

List active alerts:

./ci-sentinel alerts --config config.yaml --unacknowledged

Acknowledge an alert:

./ci-sentinel ack --config config.yaml <ALERT_ID>

API Endpoints

Endpoint Method Description
/api/workflows GET List all workflows with health metrics
/api/workflows?category=nightly GET Filter by category
/api/workflows?status=failing GET Filter by status (failing/flaky/healthy)
/api/workflow/{id} GET Workflow detail with runs and config
/api/nightly GET Nightly workflows with regression detection
/api/alerts GET List alerts
/api/alerts?unacknowledged=true GET Unacknowledged alerts only
/api/alerts/{id} GET Single alert detail
/api/alerts/{id}/ack POST Acknowledge an alert
/api/run/{id} GET Run detail with job matrix
/api/trends/{id} GET Daily metrics for trend analysis
/api/health GET Health check endpoint

Architecture

Components

Ingestion Service: Fetches workflows, runs, and jobs from GitHub Actions API with rate limit awareness.

Metrics Engine: Computes health statistics using deterministic heuristics.

Alerting Engine: Evaluates runs against alert conditions with cooldown and deduplication.

Notification Dispatcher: Routes alerts to configured sinks (Slack, Webhook, Log).

Storage Layer: SQLite database with schema for workflows, runs, jobs, alerts, and metrics.

API Server: HTTP endpoints serving dashboard and metrics data.

Data Model

Workflows: Repository CI workflows with category classification (nightly/pr/release).

Workflow Runs: Individual executions triggered by push, PR, or schedule events.

Jobs: Atomic units of work within a run, including matrix variants with logical grouping.

Alerts: Generated notifications with severity, type, and acknowledgment state.

Daily Metrics: Precomputed aggregations for trend visualization.

Alert Types

Type Trigger Severity
nightly_failure Nightly workflow fails after passing Critical
regression Mainline branch (main/master) regresses Critical
sustained_failure 3+ consecutive failures Warning
divergence Architecture variants have inconsistent status Warning

Alert Fatigue Prevention

  • Deduplication: One alert per (workflow, run, type) combination
  • Cooldown: Configurable minimum interval between repeat alerts
  • Suppression: Manual rules to silence known issues
  • Priority: Critical alerts sorted before warnings

Status Normalization

GitHub Actions statuses are normalized to three states:

  • Pass: success
  • Fail: failure, timed_out, action_required
  • Skip: cancelled, skipped, neutral

Health Metrics

Pass Rate: Percentage of passing runs in the analysis window

Flakiness: Number of status transitions between consecutive runs

Stability: Workflow is marked unstable if pass rate is between 40% and 90%

Architecture Divergence: Detected when jobs in the same logical group have different statuses across variants

Deployment

Docker

docker build -t ci-sentinel .
docker run -p 8080:8080 -v $(pwd)/config.yaml:/config.yaml -e GITHUB_TOKEN=$GITHUB_TOKEN ci-sentinel serve

Kubernetes

Deploy using the provided manifests:

kubectl apply -f k8s/

Configuration:

  • deployment.yaml: API server deployment
  • cronjob.yaml: Scheduled ingestion
  • configmap.yaml: Configuration values

Use Cases

Nightly CI Monitoring: Track mainline health separate from PR noise

Matrix Build Validation: Ensure all platform variants pass consistently

Flakiness Detection: Identify tests that flip between pass and fail states

Release Readiness: Verify CI stability before cutting releases

Design Principles

Backend-First Architecture: UI is a thin consumer of the metrics engine.

No Log Ingestion: Focuses on metadata and state, not log content. Links to GitHub for log access.

Deterministic Logic: No ML or probabilistic models, only rule-based heuristics.

API Safety: Respects GitHub rate limits with caching and incremental updates.

Signal Over Noise: Nightly failures are prioritized over PR failures; cooldowns prevent alert spam.

Limitations

  • Single repository per instance (multi-repo planned)
  • Polling-based updates (no real-time streaming)
  • No workflow re-execution capabilities
  • No log analysis or test result parsing

Configuration Reference

Section Option Description Default
Root owner GitHub repository owner Required
Root repo GitHub repository name Required
Root database SQLite database file path sentinel.db
Root window Default runs for health calculation 10
ingestion lookback_days Days of history to fetch 30
ingestion rate_limit_buffer Reserved API calls 100
alerts cooldown_minutes Minimum time between repeat alerts 60
alerts flaky_threshold Transitions to mark as flaky 3
alerts consecutive_failure_threshold Failures for sustained alert 3
notifications.sinks[] type Sink type: slack, webhook, log -
dashboard port HTTP server port 8080

Development

Project Structure

ci-sentinel/
├── cmd/ci-sentinel/       CLI entry point
├── internal/
│   ├── alerting/          Alert detection engine
│   ├── config/            Configuration loading
│   ├── github/            GitHub API client
│   ├── ingest/            Data ingestion service
│   ├── metrics/           Health computation engine
│   ├── notify/            Notification sinks (Slack, Webhook, Log)
│   ├── server/            HTTP server and API
│   │   └── static/        Dashboard UI assets
│   └── storage/           SQLite database layer
├── k8s/                   Kubernetes manifests
└── Dockerfile             Container image definition

Running Tests

go test ./...

Building

go build -o ci-sentinel cmd/ci-sentinel/main.go

About

A job-level CI observability engine for GitHub Actions that provides actionable health metrics and architecture divergence detection.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors