Skip to content

Commit 5f48e2d

Browse files
committed
llm/skills: add skill for debugging buildkite/ci failures
1 parent 8cfaf78 commit 5f48e2d

File tree

1 file changed

+88
-0
lines changed

1 file changed

+88
-0
lines changed

.claude/skills/debug-ci/SKILL.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
name: debug-ci
3+
description: Investigate CI failures on a PR using gh and bk CLI tools. Triggers when asked about failing checks, Buildkite failures, or CI issues on a PR.
4+
argument-hint: <PR number or GitHub PR URL>
5+
---
6+
7+
Investigate CI failures for a Materialize PR. Requires `gh` (GitHub CLI) and `bk` (Buildkite CLI) to be installed and authenticated.
8+
9+
**Important**: Both `gh` and `bk` make network requests that are blocked by the default sandbox. All Bash commands in this workflow must use `dangerouslyDisableSandbox: true`.
10+
11+
## Step 1: Extract PR number
12+
13+
Parse `$ARGUMENTS` to get the PR number. Handle both formats:
14+
- Plain number: `35192`
15+
- Full URL: `https://github.com/MaterializeInc/materialize/pull/35192`
16+
17+
## Step 2: List failing checks
18+
19+
```bash
20+
gh pr checks <PR_NUMBER> 2>&1
21+
```
22+
23+
Filter the output to lines containing `fail`. Each line has tab-separated fields:
24+
```
25+
name fail 0 https://buildkite.com/materialize/<PIPELINE>/builds/<BUILD>#<JOB_ID> description
26+
```
27+
28+
Extract from the URL:
29+
- **Pipeline**: path segment after `materialize/` (usually `test`)
30+
- **Build number**: the number after `builds/`
31+
- **Job ID**: the UUID after `#`
32+
33+
## Step 3: Fetch logs in triage order
34+
35+
Fetch logs starting with the jobs most likely to reveal root causes:
36+
37+
1. **clippy** — compilation/lint errors that often explain everything
38+
2. **lint-and-rustfmt** — formatting and lint-check failures
39+
3. **cargo-test** — unit/integration test failures
40+
4. **fast-sql-logic-tests** — SLT failures
41+
5. **testdrive** — integration test failures (often cascading)
42+
6. **Everything else** (checks-parallel, cluster-tests, dbt, etc.)
43+
44+
To fetch a job's log:
45+
```bash
46+
bk job log <JOB_ID> -p <PIPELINE> -b <BUILD_NUMBER> --no-timestamps --no-pager 2>&1 | tail -100
47+
```
48+
49+
For large logs, first grep for errors to find the relevant section:
50+
```bash
51+
bk job log <JOB_ID> -p <PIPELINE> -b <BUILD_NUMBER> --no-timestamps --no-pager 2>&1 | grep -B2 -A5 'error\|FAIL\|panicked'
52+
```
53+
54+
Fetch multiple job logs in parallel when they are independent (e.g., clippy + lint at the same time).
55+
56+
## Step 4: Categorize failures
57+
58+
Use these Materialize-specific patterns to diagnose:
59+
60+
### Clippy errors
61+
Code lint issues in changed files. Common ones: `as_conversions`, `needless_borrow`, `clone_on_ref_ptr`. Fix the code, not the lint config.
62+
63+
### `check-test-flags` lint failure
64+
A new configuration flag was introduced but not registered in the required places:
65+
- `misc/python/materialize/parallel_workload/action.py` (FlipFlagsAction)
66+
- `misc/python/materialize/mzcompose/__init__.py` (get_variable_system_parameters / get_minimal_system_parameters / UNINTERESTING_SYSTEM_PARAMETERS)
67+
68+
### Cargo test failures
69+
Read the panic message or assertion diff. Common patterns:
70+
- `unwrap_err() on Ok` → test expected an error but the code now succeeds
71+
- `assertion left == right failed` → behavioral change in output
72+
73+
### Testdrive cascades
74+
After one test crashes environmentd, all subsequent tests in that shard fail with `Name or service not known` or `connection closed`. **Only the first failure in a shard matters** — everything after it is a cascade. Look for the first `error:` or `FAIL` in the log.
75+
76+
Testdrive shards with the same number (e.g., `testdrive-10` and `testdrive-with-alloydb-10`) run the same tests — if both fail, it's the same root cause.
77+
78+
### SLT failures
79+
Check whether it's wrong output (behavioral change) vs. connection error (crash/timeout). Wrong output means the query semantics changed.
80+
81+
## Step 5: Summarize
82+
83+
Group failures by **root cause**, not by job name. Typically many failing jobs share just 1-2 root causes. Present the summary as:
84+
85+
1. **Root cause A** — description, which jobs it affects, what to fix
86+
2. **Root cause B** — description, which jobs it affects, what to fix
87+
88+
Distinguish between issues that are clearly caused by the PR's changes vs. pre-existing flaky tests.

0 commit comments

Comments
 (0)