llm/skills: add skill for debugging buildkite/ci failures#35248
llm/skills: add skill for debugging buildkite/ci failures#35248aljoscha merged 1 commit intoMaterializeInc:mainfrom
Conversation
|
Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone. PR title guidelines
Pre-merge checklist
|
antiguru
left a comment
There was a problem hiding this comment.
Dis you figure out a way for the cli to fetch artifacts? I failed because it'd need graphql permissions.
I fixed this by giving the token graphql access. It doesn't allow limiting by scopes, but ... 🤷♂️ |
| ## Step 1: Extract PR number | ||
|
|
||
| Parse `$ARGUMENTS` to get the PR number. Handle both formats: | ||
| - Plain number: `35192` | ||
| - Full URL: `https://github.com/MaterializeInc/materialize/pull/35192` | ||
|
|
||
| ## Step 2: List failing checks | ||
|
|
||
| ```bash | ||
| gh pr checks <PR_NUMBER> 2>&1 | ||
| ``` | ||
|
|
||
| Filter the output to lines containing `fail`. Each line has tab-separated fields: | ||
| ``` | ||
| name fail 0 https://buildkite.com/materialize/<PIPELINE>/builds/<BUILD>#<JOB_ID> description | ||
| ``` | ||
|
|
||
| Extract from the URL: | ||
| - **Pipeline**: path segment after `materialize/` (usually `test`) | ||
| - **Build number**: the number after `builds/` | ||
| - **Job ID**: the UUID after `#` |
There was a problem hiding this comment.
It might be easier to just do
bk build list --branch=def-:pr-fix-secret-cli
where def- is the username of my fork and pr-fix-secret-cli the branch name, instead of these github api roundtrips.
There was a problem hiding this comment.
Claude said it wouldn't want to add this, because it's to complicated to know the username of the fork, but we did incorporate the other suggestions. 🙇♂️
.claude/skills/debug-ci/SKILL.md
Outdated
| - **Build number**: the number after `builds/` | ||
| - **Job ID**: the UUID after `#` | ||
|
|
||
| ## Step 3: Fetch logs in triage order |
There was a problem hiding this comment.
I would look at the annotation for test failures first, before looking into logs. Can save you a bunch of tokens or grepping around.
.claude/skills/debug-ci/SKILL.md
Outdated
| ### Testdrive cascades | ||
| After one test crashes environmentd, all subsequent tests in that shard fail with `Name or service not known` or `connection closed`. **Only the first failure in a shard matters** — everything after it is a cascade. Look for the first `error:` or `FAIL` in the log. | ||
|
|
||
| Testdrive shards with the same number (e.g., `testdrive-10` and `testdrive-with-alloydb-10`) run the same tests — if both fail, it's the same root cause. |
There was a problem hiding this comment.
| Testdrive shards with the same number (e.g., `testdrive-10` and `testdrive-with-alloydb-10`) run the same tests — if both fail, it's the same root cause. | |
| Testdrive shards with the same number (e.g., `testdrive-10` and `testdrive-with-alloydb-10`) run the same tests — if both fail, it's likely to be the same root cause. |
.claude/skills/debug-ci/SKILL.md
Outdated
| 1. **Root cause A** — description, which jobs it affects, what to fix | ||
| 2. **Root cause B** — description, which jobs it affects, what to fix | ||
|
|
||
| Distinguish between issues that are clearly caused by the PR's changes vs. pre-existing flaky tests. |
There was a problem hiding this comment.
Pre-existing flaky tests can often be discovered through the annotations, which will link to the issue that is causing it.
5f48e2d to
320c6dd
Compare
320c6dd to
b6d0a08
Compare
No description provided.