Skip to content

fix(sdk): add max retry limit to rate-limited API calls#6542

Open
zhoufengen wants to merge 1 commit intocomet-ml:mainfrom
zhoufengen:fix/rate-limit-retry-max-attempts
Open

fix(sdk): add max retry limit to rate-limited API calls#6542
zhoufengen wants to merge 1 commit intocomet-ml:mainfrom
zhoufengen:fix/rate-limit-retry-max-attempts

Conversation

@zhoufengen
Copy link
Copy Markdown

Summary

The ensure_rest_api_call_respecting_rate_limit function in the Python SDK retries indefinitely on persistent HTTP 429 responses, which can cause callers to hang forever when account quota is exhausted or rate limits are misconfigured.

Changes

  • Replace while True loop with a bounded for loop capped at MAX_RATE_LIMIT_RETRIES = 10 attempts
  • Re-raise the ApiError after exhausting all retries
  • Add attempt count to log messages for better observability

Testing

  • The change is backward-compatible: normal rate-limited calls that succeed within 10 retries behave identically
  • The retry decorator (retry_decorator.py) already caps at 3 attempts for other errors; this brings rate limit handling in line with that pattern
  • Edge case: after 10 consecutive 429 responses, the original ApiError is raised to the caller

Related Issues

None (discovered during code review)


This PR was authored with AI assistance. All changes have been reviewed for correctness.

The ensure_rest_api_call_respecting_rate_limit function retried indefinitely
on persistent HTTP 429 responses, which could cause callers to hang forever
(e.g., when account quota is exhausted). Add a MAX_RATE_LIMIT_RETRIES cap
of 10 attempts, after which the 429 ApiError is re-raised.
@zhoufengen zhoufengen requested a review from a team as a code owner April 29, 2026 10:30
@github-actions github-actions Bot added python Pull requests that update Python code Python SDK labels Apr 29, 2026
Comment on lines +40 to +45
if attempt >= MAX_RATE_LIMIT_RETRIES - 1:
LOGGER.error(
"Rate limit retries exhausted after %d attempts",
MAX_RATE_LIMIT_RETRIES,
)
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the 10th 429 retry is exhausted, this raise rethrows the original ApiError instead of converting it to opik.exceptions.OpikCloudRequestsRateLimited with headers/retry_after—should we wrap the last 429 here?

Finding type: AI Coding Guidelines | Severity: 🟠 Medium


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
sdks/python/src/opik/api_objects/rest_helpers.py around lines 40-45 inside
`ensure_rest_api_call_respecting_rate_limit`, the code currently re-raises the original
`ApiError` when the final 429 retry is exhausted (`attempt >= MAX_RATE_LIMIT_RETRIES -
1`). Refactor this branch to convert that last rate-limit failure into the standardized
`opik.exceptions.OpikCloudRequestsRateLimited`, populating it with the response
`headers` and the computed `retry_after` (use
`rate_limit.parse_rate_limit(exception.headers).retry_after()` when possible, otherwise
fall back to 1). Ensure the raised exception type matches what callers in `dataset.py`
and `annotation_queue.py` (and their existing retry guidance in
`online_message_processor.py` / `queue_consumer.py`) expect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Python SDK python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant