Skip to content

llm/skills: add skill for debugging buildkite/ci failures#35248

Merged
aljoscha merged 1 commit intoMaterializeInc:mainfrom
aljoscha:push-srwssurrzvmx
Mar 2, 2026
Merged

llm/skills: add skill for debugging buildkite/ci failures#35248
aljoscha merged 1 commit intoMaterializeInc:mainfrom
aljoscha:push-srwssurrzvmx

Conversation

@aljoscha
Copy link
Contributor

No description provided.

@github-actions
Copy link

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dis you figure out a way for the cli to fetch artifacts? I failed because it'd need graphql permissions.

@aljoscha
Copy link
Contributor Author

Dis you figure out a way for the cli to fetch artifacts? I failed because it'd need graphql permissions.

I fixed this by giving the token graphql access. It doesn't allow limiting by scopes, but ... 🤷‍♂️

@aljoscha aljoscha requested a review from antiguru February 27, 2026 19:39
Comment on lines 11 to 31
## Step 1: Extract PR number

Parse `$ARGUMENTS` to get the PR number. Handle both formats:
- Plain number: `35192`
- Full URL: `https://github.com/MaterializeInc/materialize/pull/35192`

## Step 2: List failing checks

```bash
gh pr checks <PR_NUMBER> 2>&1
```

Filter the output to lines containing `fail`. Each line has tab-separated fields:
```
name fail 0 https://buildkite.com/materialize/<PIPELINE>/builds/<BUILD>#<JOB_ID> description
```

Extract from the URL:
- **Pipeline**: path segment after `materialize/` (usually `test`)
- **Build number**: the number after `builds/`
- **Job ID**: the UUID after `#`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to just do

bk build list --branch=def-:pr-fix-secret-cli

where def- is the username of my fork and pr-fix-secret-cli the branch name, instead of these github api roundtrips.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude said it wouldn't want to add this, because it's to complicated to know the username of the fork, but we did incorporate the other suggestions. 🙇‍♂️

- **Build number**: the number after `builds/`
- **Job ID**: the UUID after `#`

## Step 3: Fetch logs in triage order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would look at the annotation for test failures first, before looking into logs. Can save you a bunch of tokens or grepping around.

### Testdrive cascades
After one test crashes environmentd, all subsequent tests in that shard fail with `Name or service not known` or `connection closed`. **Only the first failure in a shard matters** — everything after it is a cascade. Look for the first `error:` or `FAIL` in the log.

Testdrive shards with the same number (e.g., `testdrive-10` and `testdrive-with-alloydb-10`) run the same tests — if both fail, it's the same root cause.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Testdrive shards with the same number (e.g., `testdrive-10` and `testdrive-with-alloydb-10`) run the same tests — if both fail, it's the same root cause.
Testdrive shards with the same number (e.g., `testdrive-10` and `testdrive-with-alloydb-10`) run the same tests — if both fail, it's likely to be the same root cause.

1. **Root cause A** — description, which jobs it affects, what to fix
2. **Root cause B** — description, which jobs it affects, what to fix

Distinguish between issues that are clearly caused by the PR's changes vs. pre-existing flaky tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing flaky tests can often be discovered through the annotations, which will link to the issue that is causing it.

@aljoscha aljoscha force-pushed the push-srwssurrzvmx branch from 5f48e2d to 320c6dd Compare March 1, 2026 13:50
@aljoscha aljoscha force-pushed the push-srwssurrzvmx branch from 320c6dd to b6d0a08 Compare March 2, 2026 06:51
@aljoscha aljoscha merged commit ee7b07a into MaterializeInc:main Mar 2, 2026
5 checks passed
@aljoscha aljoscha deleted the push-srwssurrzvmx branch March 2, 2026 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants