Fix race between mc bucket creation and MinIO startup#406
Fix race between mc bucket creation and MinIO startup#406
Conversation
The create-s3-bucket / remove-s3-bucket one-shot containers ran `mc alias set` the moment MinIO's container had Started, without waiting for the MinIO server to actually be listening on :9000. Under CI load this occasionally raced MinIO's initialization: mc failed to connect, the shell swallowed the error because the entrypoint used `;` chaining plus an unconditional `exit 0`, and `docker compose up` reported success. Weaviate's subsequent backup access-check then got NoSuchBucket -> 422. Retry `mc alias set` until MinIO accepts connections (bounded to 60s), and use `set -e` so any later mc failure surfaces instead of being masked. Applied to all four affected compose files. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
Orca Security Scan Summary
| Status | Check | Issues by priority | |
|---|---|---|---|
| Infrastructure as Code | View in Orca | ||
| SAST | View in Orca | ||
| Secrets | View in Orca | ||
| Vulnerabilities | View in Orca |
With set -e, a pre-existing bucket would now fail the entrypoint (previously silently tolerated via ; + exit 0). Add --ignore-existing so mc mb succeeds whether the bucket is new or already present. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
Pull request overview
Fixes flakiness in S3-backup test flows by ensuring the MinIO client (mc) waits for MinIO to actually accept connections before creating/removing the test bucket, and by surfacing failures instead of masking them.
Changes:
- Add a bounded retry loop (up to 60s) around
mc alias setin the one-shot bucket create/remove containers. - Enable
set -ein those entrypoints and remove unconditionalexit 0so failures propagate. - Apply the same entrypoint changes across all compose files that define
create-s3-bucket/remove-s3-bucket.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| apps/weaviate/docker-compose-backup.yml | Add retry + fail-fast behavior to bucket create/remove entrypoints for the backup flow |
| apps/weaviate/docker-compose-backup-3nodes.yml | Same retry + fail-fast behavior for 3-node backup compose |
| apps/weaviate/docker-compose-backup-3nodes-renamed.yml | Same retry + fail-fast behavior for renamed 3-node backup compose |
| apps/replicated_import_with_backup/docker-compose.yml | Same retry + fail-fast behavior for replicated import + backup compose |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /usr/bin/mc mb --ignore-existing chaos/weaviate-backups; | ||
| /usr/bin/mc policy set public chaos/weaviate-backups; |
There was a problem hiding this comment.
With set -e enabled, mc mb chaos/weaviate-backups will now fail the one-shot container if the bucket already exists from a previous run (MinIO data persists via the host bind mount). Making bucket creation idempotent (e.g., mc mb --ignore-existing or pre-check) would avoid unnecessary failures on reruns.
Summary
create-s3-bucket/remove-s3-bucketone-shot containers were runningmc alias setthe moment MinIO's container hit Started, without waiting for the MinIO server to actually accept connections on :9000. Under CI load this occasionally lost the race,mcfailed to connect, and the original entrypoint swallowed the error via;chaining + unconditionalexit 0. Weaviate's subsequent backup access-checkPutObjectthen gotNoSuchBucket→ HTTP 422.create-s3-bucketandremove-s3-bucketin all four compose files that define them: retrymc alias setuntil MinIO accepts a connection (bounded to 60 attempts at 1s apart), and useset -eso any latermcfailure surfaces instead of being masked.backup_and_restore_version_compatibility.sh,backup_and_restore_multi_node_crud.sh,backup_and_restore_multi_node_out_of_sync.sh,backup_and_restore_node_mapping.sh,replication_importing_with_backup.sh) benefits automatically.Context
This was observed in the
backup_and_restore_version_compatibilitypipeline on iteration 14 of a 30-version sweep: https://github.com/weaviate/weaviate-chaos-engineering/actions/runs/24540725312/job/71745720199Job-log timing shows the race clearly:
00:23:06.411—backup-s3 Started(container up, MinIO still initializing)00:23:14.612—Weaviate node1 is ready(script proceeds — never probes MinIO)00:23:15.199—mc: connection refusedon 172.23.0.10:9000 (MinIO still not listening)00:23:15.223—Bucket created successfully(30 ms later; MinIO just barely came up, bucket not durable yet)00:23:15.354—create-s3-bucket exited with code 0(masked byexit 0)00:23:16.564— Weaviate access-check PUT →NoSuchBucket→ 422In the immediately preceding iteration the log shows
Added 'chaos' successfully.and abackup-s3 Runningevent beforecreate-s3-bucketstarted. In the failing iteration both are absent, confirming MinIO was still starting whenmcraced it.Test plan
docker compose -f apps/weaviate/docker-compose-backup.yml config— all four compose files parse cleanly.sh -non the rendered entrypoint — valid.up -d backup-s3thenup create-s3-bucket— bucket created, exit 0.up remove-s3-bucket— bucket removed, exit 0.Waiting for MinIO to be ready...messages followed byERROR: MinIO not ready after 3sand a non-zero exit (confirmsset -e+ bounded loop surface failures).backup_and_restore_version_compatibilityTest matrix job that previously flaked to confirm it no longer reproduces.🤖 Generated with Claude Code