ci: persist build cache across runs, parallelize release upload#28561
ci: persist build cache across runs, parallelize release upload#28561
Conversation
The build cache (ccache, zig cache, WebKit/dep tarballs) lives at
${buildDir}/cache by default, which is inside the git checkout. Buildkite
runs `git clean -ffxdq` between builds, so every build starts cold —
the caching infrastructure exists but has a 0% hit rate in CI.
- config.ts: read BUN_BUILD_CACHE_PATH env var as cacheDir default
- ci.ts: don't rm cacheDir during disk cleanup when it's outside buildDir
- Dockerfile + bootstrap.sh: set BUN_BUILD_CACHE_PATH in the buildkite
environment hook, pointing at /var/lib/buildkite-agent/cache/build
Separately, upload-release.sh processed 22 artifacts sequentially
(download + S3×2 + GitHub upload per artifact). The inner uploads were
already parallel; the outer loop wasn't. Now runs up to 6 artifacts
concurrently with a bounded job pool. `wait -n` + `set -e` propagates
failures correctly.
WalkthroughThese changes introduce Bun build cache support by establishing persistent cache directories in Buildkite infrastructure, configuring cache paths through environment variables, optimizing artifact uploads with bounded parallelism, and conditionally preserving external cache directories during cleanup. Changes
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/build/config.ts`:
- Around line 416-418: The env-driven cacheDir resolution currently treats empty
or relative BUN_BUILD_CACHE_PATH as cwd which diverges from partial.cacheDir;
update the logic that computes cacheDir in scripts/build/config.ts to (1) treat
an empty string as "not set", (2) if BUN_BUILD_CACHE_PATH is set and absolute
use it, otherwise resolve it relative to the repo build root (the same base used
for partial.cacheDir, e.g. buildDir) and (3) fallback to resolve(buildDir,
"cache") when unset — adjust the branch that references
process.env.BUN_BUILD_CACHE_PATH to implement these checks so cacheDir behavior
matches partial.cacheDir semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: bfb3fccb-50c8-4a8b-94ac-3d1d1a3d504b
📒 Files selected for processing (5)
.buildkite/Dockerfile.buildkite/scripts/upload-release.shscripts/bootstrap.shscripts/build/ci.tsscripts/build/config.ts
| : process.env.BUN_BUILD_CACHE_PATH !== undefined | ||
| ? resolve(process.env.BUN_BUILD_CACHE_PATH) | ||
| : resolve(buildDir, "cache"); |
There was a problem hiding this comment.
BUN_BUILD_CACHE_PATH resolution is inconsistent for relative/empty values.
At Line 416, env-driven cacheDir can resolve relative to process cwd (not repo root), and an empty value resolves to cwd. This diverges from partial.cacheDir semantics and can place cache unexpectedly.
🔧 Proposed fix
+ const envCacheDir = process.env.BUN_BUILD_CACHE_PATH?.trim();
const cacheDir =
partial.cacheDir !== undefined
? isAbsolute(partial.cacheDir)
? partial.cacheDir
: resolve(cwd, partial.cacheDir)
- : process.env.BUN_BUILD_CACHE_PATH !== undefined
- ? resolve(process.env.BUN_BUILD_CACHE_PATH)
+ : envCacheDir
+ ? isAbsolute(envCacheDir)
+ ? envCacheDir
+ : resolve(cwd, envCacheDir)
: resolve(buildDir, "cache");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/build/config.ts` around lines 416 - 418, The env-driven cacheDir
resolution currently treats empty or relative BUN_BUILD_CACHE_PATH as cwd which
diverges from partial.cacheDir; update the logic that computes cacheDir in
scripts/build/config.ts to (1) treat an empty string as "not set", (2) if
BUN_BUILD_CACHE_PATH is set and absolute use it, otherwise resolve it relative
to the repo build root (the same base used for partial.cacheDir, e.g. buildDir)
and (3) fallback to resolve(buildDir, "cache") when unset — adjust the branch
that references process.env.BUN_BUILD_CACHE_PATH to implement these checks so
cacheDir behavior matches partial.cacheDir semantics.
|
Found 5 issues this PR may fix:
🤖 Generated with Claude Code |
| local max_jobs=6 | ||
| for artifact in "${artifacts[@]}"; do | ||
| upload_artifact "$artifact" | ||
| while (( $(jobs -rp | wc -l) >= max_jobs )); do | ||
| wait -n | ||
| done | ||
| upload_artifact "$artifact" & | ||
| done | ||
| wait | ||
|
|
There was a problem hiding this comment.
🔴 When any upload_artifact background job fails, wait -n returns non-zero and set -eo pipefail immediately exits the script, leaving up to 5 other in-flight upload_artifact background processes orphaned (bash does not kill background jobs on script exit). As a result, the outer wait and subsequent update_github_release/create_sentry_release/send_discord_announcement calls are never reached, leaving the release in an unpredictable partial state. Fix by adding trap to kill remaining background jobs on EXIT.
Extended reasoning...
What the bug is and how it manifests
The script has set -eo pipefail at line 3. The new semaphore loop (lines 246-254) launches upload_artifact as background jobs and uses wait -n inside the throttle loop to block when max_jobs=6 slots are all occupied. If any in-flight background job exits with a non-zero status (e.g. a transient AWS/GitHub network error), wait -n returns that non-zero exit code. Because set -e is active, the shell immediately exits at that point.
The specific code path that triggers it
The semaphore while-loop body calls wait -n to get a job slot. On failure, wait -n returns the failed job exit code. set -e immediately exits the script. Up to 5 other upload_artifact background jobs are still in-flight at this point. In non-interactive bash (the mode used by CI scripts), background jobs are NOT sent SIGHUP when the parent exits - they are reparented to PID 1 and continue running as orphans. The outer wait at line 253 is never reached, and neither are update_github_release/create_sentry_release/send_discord_announcement.
Why existing code does not prevent it
The PR description notes "wait -n + set -e propagates failures correctly" - and it does propagate the failure (the step fails). But it does not clean up. There is no trap on EXIT to kill remaining jobs. The orphaned processes continue uploading to S3/GitHub without the build script tracking them. The old sequential code did not have this problem because upload_artifact ran synchronously - the inner wait inside upload_artifact completed all inner jobs before the outer loop advanced. No background outer jobs existed to orphan.
Impact
After a single upload failure: (1) up to 5 other in-flight upload_artifact jobs keep writing artifacts to S3 and GitHub Release with no parent tracking their success or failure; (2) update_github_release is never called so canary release notes are not updated; (3) create_sentry_release is never called; (4) send_discord_announcement is never called. The release may end up partially uploaded with no build-step signal about which artifacts are missing. A re-run may double-upload some artifacts and miss others.
Step-by-step proof
- The loop starts; artifacts 1-6 are launched as background jobs, filling max_jobs=6 slots.
- Artifact 3 fails (aws s3 cp returns exit code 1 inside upload_artifact).
- The while condition re-evaluates: jobs -rp | wc -l is still >= 6, so wait -n is called.
- wait -n returns 1 (the exit code from artifact 3 job).
- set -e triggers: the script exits immediately.
- Artifacts 1, 2, 4, 5, 6 are still running their three parallel inner uploads (S3 commit path, S3 tag path, GitHub release) - now orphaned, reparented to PID 1.
- The outer wait, update_github_release, create_sentry_release, and send_discord_announcement are never reached.
How to fix it
Add a trap before the loop: trap kill jobs -rp 2 dev null wait EXIT. This ensures that on any exit all in-flight background jobs are killed and waited on, preventing orphans. Alternatively, collect failures manually with an error flag variable rather than relying on set -e to propagate them from wait -n.
| : process.env.BUN_BUILD_CACHE_PATH !== undefined | ||
| ? resolve(process.env.BUN_BUILD_CACHE_PATH) | ||
| : resolve(buildDir, "cache"); |
There was a problem hiding this comment.
🟡 In scripts/build/config.ts, the BUN_BUILD_CACHE_PATH check guards against undefined but not an empty string — if someone runs export BUN_BUILD_CACHE_PATH= in their shell, resolve("") returns process.cwd() (the repo root), silently writing all build caches there. Because the repo root is not inside buildDir, the startsWith(buildDir + sep) guard in ci.ts skips cleanup, leaving the cache permanently in the repo root. Fix: use a truthy check (process.env.BUN_BUILD_CACHE_PATH) instead of !== undefined.
Extended reasoning...
The condition at scripts/build/config.ts:416 reads:
process.env.BUN_BUILD_CACHE_PATH !== undefined
? resolve(process.env.BUN_BUILD_CACHE_PATH)
: resolve(buildDir, "cache");In POSIX shells, export BUN_BUILD_CACHE_PATH= sets the variable to an empty string. process.env.BUN_BUILD_CACHE_PATH !== undefined evaluates to true for an empty string, so the branch is taken. Node.js/Bun's path.resolve("") is equivalent to path.resolve() — it returns process.cwd(), which in this repository is the repo root (e.g., /home/claude/bun).
The consequence is that all build cache directories — ccache, zig-cache, and WebKit prebuilt tarballs — are created directly under the repository root instead of the intended subdirectory. The build succeeds, but artifacts end up in unexpected locations that are hard to notice.
A secondary effect compounds the issue. In scripts/build/ci.ts, the post-build cleanup guard is:
if (cfg.cacheDir.startsWith(cfg.buildDir + sep)) {
rmSync(cfg.cacheDir, { recursive: true, force: true });
}Since the repo root (resolved from BUN_BUILD_CACHE_PATH="") is a parent of buildDir, not a child, startsWith(buildDir + sep) is false and rmSync is never called. Cache files accumulate in the repo root and are never cleaned up.
Step-by-step proof:
- Developer runs
export BUN_BUILD_CACHE_PATH=(clearing the variable without unsetting it). process.env.BUN_BUILD_CACHE_PATHis""— notundefined, so the guard passes.resolve("")returnsprocess.cwd()=/home/claude/bun(the repo root).cfg.cacheDiris set to/home/claude/bun.- ccache, zig-cache, and WebKit tarballs are written into
/home/claude/bun. - Cleanup check:
/home/claude/bun.startsWith(/home/claude/bun/build/release/) isfalse— skipped. - Cache files accumulate in the repo root permanently.
In practice this is low-risk: all CI environments (Dockerfile and bootstrap.sh) assign concrete non-empty paths, and no mechanism in the PR's CI setup would result in an empty value. A developer would have to explicitly set the variable to empty to trigger this. The one-word fix is to replace !== undefined with a truthy check.
Summary
Two independent CI speed fixes:
1. Persist build cache outside the git checkout
The build cache (ccache, zig cache, WebKit/dep tarballs) defaults to
${buildDir}/cache, which is inside the git checkout. Buildkite runsgit clean -ffxdqbetween builds, wiping it every time — so the caching infrastructure exists but has a 0% hit rate in CI.scripts/build/config.ts— readBUN_BUILD_CACHE_PATHenv var as the cacheDir default (falls back to${buildDir}/cachewhen unset, so local builds are unchanged)scripts/build/ci.ts— skip the post-buildrmSync(cacheDir)disk cleanup when cacheDir is outside buildDir.buildkite/Dockerfile+scripts/bootstrap.sh— exportBUN_BUILD_CACHE_PATH=/var/lib/buildkite-agent/cache/buildin the buildkite environment hook and create the directoryThis should turn cold ccache/zig-cache into warm on persistent agents. The
--cache-dirCLI flag still takes precedence for manual overrides.2. Parallelize release artifact upload
upload-release.shprocessed 22 zips sequentially — each iteration downloads the artifact, then runs three parallel uploads (S3 commit path, S3 tag path, GitHub release), thenwaits before moving to the next. The inner uploads were already parallel; the outer loop wasn't.Now runs up to 6 artifacts concurrently with a bounded job pool (
wait -nsemaphore). Bounded to avoid hammering the GitHub API and keep peak disk usage reasonable.wait -n+set -epropagates failures correctly — if any upload fails, the step fails.Not in this PR
Left
CCACHE_SLOPPINESSalone — the!cfg.ciguard exists because Ubuntu 20.04 ships ccache 3.7.7 and some sloppiness values (gcno_cwd) need 4.5+. That needs version detection or a base image bump, separate PR.Test plan
bash -non both shell scriptsbun -e 'await import(...)')BUN_BUILD_CACHE_PATH=/tmp/test-cacheis picked up byresolveConfigand thestartsWith(buildDir + sep)check correctly returns false${buildDir}/cache)wait -n+set -epropagates background job failures