llm/skills: add skill for debugging buildkite/ci failures

aljoscha · aljoscha · commit 5f48e2dd8e67 · 2026-02-27T15:34:42.000+01:00
diff --git a/.claude/skills/debug-ci/SKILL.md b/.claude/skills/debug-ci/SKILL.md
@@ -0,0 +1,88 @@
+---
+name: debug-ci
+description: Investigate CI failures on a PR using gh and bk CLI tools. Triggers when asked about failing checks, Buildkite failures, or CI issues on a PR.
+argument-hint: <PR number or GitHub PR URL>
+---
+
+Investigate CI failures for a Materialize PR. Requires `gh` (GitHub CLI) and `bk` (Buildkite CLI) to be installed and authenticated.
+
+**Important**: Both `gh` and `bk` make network requests that are blocked by the default sandbox. All Bash commands in this workflow must use `dangerouslyDisableSandbox: true`.
+
+## Step 1: Extract PR number
+
+Parse `$ARGUMENTS` to get the PR number. Handle both formats:
+- Plain number: `35192`
+- Full URL: `https://github.com/MaterializeInc/materialize/pull/35192`
+
+## Step 2: List failing checks
+
+```bash
+gh pr checks <PR_NUMBER> 2>&1
+```
+
+Filter the output to lines containing `fail`. Each line has tab-separated fields:
+```
+name	fail	0	https://buildkite.com/materialize/<PIPELINE>/builds/<BUILD>#<JOB_ID>	description
+```
+
+Extract from the URL:
+- **Pipeline**: path segment after `materialize/` (usually `test`)
+- **Build number**: the number after `builds/`
+- **Job ID**: the UUID after `#`
+
+## Step 3: Fetch logs in triage order
+
+Fetch logs starting with the jobs most likely to reveal root causes:
+
+1. **clippy** — compilation/lint errors that often explain everything
+2. **lint-and-rustfmt** — formatting and lint-check failures
+3. **cargo-test** — unit/integration test failures
+4. **fast-sql-logic-tests** — SLT failures
+5. **testdrive** — integration test failures (often cascading)
+6. **Everything else** (checks-parallel, cluster-tests, dbt, etc.)
+
+To fetch a job's log:
+```bash
+bk job log <JOB_ID> -p <PIPELINE> -b <BUILD_NUMBER> --no-timestamps --no-pager 2>&1 | tail -100
+```
+
+For large logs, first grep for errors to find the relevant section:
+```bash
+bk job log <JOB_ID> -p <PIPELINE> -b <BUILD_NUMBER> --no-timestamps --no-pager 2>&1 | grep -B2 -A5 'error\|FAIL\|panicked'
+```
+
+Fetch multiple job logs in parallel when they are independent (e.g., clippy + lint at the same time).
+
+## Step 4: Categorize failures
+
+Use these Materialize-specific patterns to diagnose:
+
+### Clippy errors
+Code lint issues in changed files. Common ones: `as_conversions`, `needless_borrow`, `clone_on_ref_ptr`. Fix the code, not the lint config.
+
+### `check-test-flags` lint failure
+A new configuration flag was introduced but not registered in the required places:
+- `misc/python/materialize/parallel_workload/action.py` (FlipFlagsAction)
+- `misc/python/materialize/mzcompose/__init__.py` (get_variable_system_parameters / get_minimal_system_parameters / UNINTERESTING_SYSTEM_PARAMETERS)
+
+### Cargo test failures
+Read the panic message or assertion diff. Common patterns:
+- `unwrap_err() on Ok` → test expected an error but the code now succeeds
+- `assertion left == right failed` → behavioral change in output
+
+### Testdrive cascades
+After one test crashes environmentd, all subsequent tests in that shard fail with `Name or service not known` or `connection closed`. **Only the first failure in a shard matters** — everything after it is a cascade. Look for the first `error:` or `FAIL` in the log.
+
+Testdrive shards with the same number (e.g., `testdrive-10` and `testdrive-with-alloydb-10`) run the same tests — if both fail, it's the same root cause.
+
+### SLT failures
+Check whether it's wrong output (behavioral change) vs. connection error (crash/timeout). Wrong output means the query semantics changed.
+
+## Step 5: Summarize
+
+Group failures by **root cause**, not by job name. Typically many failing jobs share just 1-2 root causes. Present the summary as:
+
+1. **Root cause A** — description, which jobs it affects, what to fix
+2. **Root cause B** — description, which jobs it affects, what to fix
+
+Distinguish between issues that are clearly caused by the PR's changes vs. pre-existing flaky tests.