Skip to content

Fix race between mc bucket creation and MinIO startup#406

Open
jfrancoa wants to merge 2 commits intomainfrom
jose/fix-minio-bucket-race
Open

Fix race between mc bucket creation and MinIO startup#406
jfrancoa wants to merge 2 commits intomainfrom
jose/fix-minio-bucket-race

Conversation

@jfrancoa
Copy link
Copy Markdown
Contributor

Summary

  • The backup-s3 test flow races MinIO startup: the create-s3-bucket / remove-s3-bucket one-shot containers were running mc alias set the moment MinIO's container hit Started, without waiting for the MinIO server to actually accept connections on :9000. Under CI load this occasionally lost the race, mc failed to connect, and the original entrypoint swallowed the error via ; chaining + unconditional exit 0. Weaviate's subsequent backup access-check PutObject then got NoSuchBucket → HTTP 422.
  • Fix the entrypoints for create-s3-bucket and remove-s3-bucket in all four compose files that define them: retry mc alias set until MinIO accepts a connection (bounded to 60 attempts at 1s apart), and use set -e so any later mc failure surfaces instead of being masked.
  • No shell-script changes were needed — the fix is isolated to the mc entrypoints in the compose files, so every caller (backup_and_restore_version_compatibility.sh, backup_and_restore_multi_node_crud.sh, backup_and_restore_multi_node_out_of_sync.sh, backup_and_restore_node_mapping.sh, replication_importing_with_backup.sh) benefits automatically.

Context

This was observed in the backup_and_restore_version_compatibility pipeline on iteration 14 of a 30-version sweep: https://github.com/weaviate/weaviate-chaos-engineering/actions/runs/24540725312/job/71745720199

Backup Create returned status code 422 with body:
{'error': [{'message': 'init uploader: failed to access-check s3 backup module:
 put object: weaviate-backups:backup_1776385396/access-check:
 The specified bucket does not exist'}]}

Job-log timing shows the race clearly:

  • 00:23:06.411backup-s3 Started (container up, MinIO still initializing)
  • 00:23:14.612Weaviate node1 is ready (script proceeds — never probes MinIO)
  • 00:23:15.199mc: connection refused on 172.23.0.10:9000 (MinIO still not listening)
  • 00:23:15.223Bucket created successfully (30 ms later; MinIO just barely came up, bucket not durable yet)
  • 00:23:15.354create-s3-bucket exited with code 0 (masked by exit 0)
  • 00:23:16.564 — Weaviate access-check PUT → NoSuchBucket → 422

In the immediately preceding iteration the log shows Added 'chaos' successfully. and a backup-s3 Running event before create-s3-bucket started. In the failing iteration both are absent, confirming MinIO was still starting when mc raced it.

Test plan

  • docker compose -f apps/weaviate/docker-compose-backup.yml config — all four compose files parse cleanly.
  • sh -n on the rendered entrypoint — valid.
  • Happy path: up -d backup-s3 then up create-s3-bucket — bucket created, exit 0.
  • Happy path: up remove-s3-bucket — bucket removed, exit 0.
  • Failure path: ran the retry loop against a non-existent host with a 3-attempt cap — observed three Waiting for MinIO to be ready... messages followed by ERROR: MinIO not ready after 3s and a non-zero exit (confirms set -e + bounded loop surface failures).
  • Re-run the backup_and_restore_version_compatibility Test matrix job that previously flaked to confirm it no longer reproduces.

🤖 Generated with Claude Code

The create-s3-bucket / remove-s3-bucket one-shot containers ran `mc alias
set` the moment MinIO's container had Started, without waiting for the
MinIO server to actually be listening on :9000. Under CI load this
occasionally raced MinIO's initialization: mc failed to connect, the
shell swallowed the error because the entrypoint used `;` chaining plus
an unconditional `exit 0`, and `docker compose up` reported success.
Weaviate's subsequent backup access-check then got NoSuchBucket -> 422.

Retry `mc alias set` until MinIO accepts connections (bounded to 60s),
and use `set -e` so any later mc failure surfaces instead of being
masked. Applied to all four affected compose files.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown

@orca-security-eu orca-security-eu Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orca Security Scan Summary

Status Check Issues by priority
Passed Passed Infrastructure as Code high 0   medium 0   low 0   info 0 View in Orca
Passed Passed SAST high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Secrets high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Vulnerabilities high 0   medium 0   low 0   info 0 View in Orca

With set -e, a pre-existing bucket would now fail the entrypoint
(previously silently tolerated via ; + exit 0). Add --ignore-existing
so mc mb succeeds whether the bucket is new or already present.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes flakiness in S3-backup test flows by ensuring the MinIO client (mc) waits for MinIO to actually accept connections before creating/removing the test bucket, and by surfacing failures instead of masking them.

Changes:

  • Add a bounded retry loop (up to 60s) around mc alias set in the one-shot bucket create/remove containers.
  • Enable set -e in those entrypoints and remove unconditional exit 0 so failures propagate.
  • Apply the same entrypoint changes across all compose files that define create-s3-bucket / remove-s3-bucket.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
apps/weaviate/docker-compose-backup.yml Add retry + fail-fast behavior to bucket create/remove entrypoints for the backup flow
apps/weaviate/docker-compose-backup-3nodes.yml Same retry + fail-fast behavior for 3-node backup compose
apps/weaviate/docker-compose-backup-3nodes-renamed.yml Same retry + fail-fast behavior for renamed 3-node backup compose
apps/replicated_import_with_backup/docker-compose.yml Same retry + fail-fast behavior for replicated import + backup compose

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apps/weaviate/docker-compose-backup.yml
Comment on lines +165 to 166
/usr/bin/mc mb --ignore-existing chaos/weaviate-backups;
/usr/bin/mc policy set public chaos/weaviate-backups;
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With set -e enabled, mc mb chaos/weaviate-backups will now fail the one-shot container if the bucket already exists from a previous run (MinIO data persists via the host bind mount). Making bucket creation idempotent (e.g., mc mb --ignore-existing or pre-check) would avoid unnecessary failures on reruns.

Copilot uses AI. Check for mistakes.
Comment thread apps/weaviate/docker-compose-backup-3nodes-renamed.yml
Comment thread apps/replicated_import_with_backup/docker-compose.yml
Comment thread apps/weaviate/docker-compose-backup.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants