Skip to content

fix(api): add query-level retry with primary fallback to rls_transaction via execute_wrapper#10379

Open
josema-xyz wants to merge 5 commits intomasterfrom
PROWLER-1225-fix-read-queries-on-the-read-replica-that-doesnt-use-the-write-replica-on-retries--second-attempt
Open

fix(api): add query-level retry with primary fallback to rls_transaction via execute_wrapper#10379
josema-xyz wants to merge 5 commits intomasterfrom
PROWLER-1225-fix-read-queries-on-the-read-replica-that-doesnt-use-the-write-replica-on-retries--second-attempt

Conversation

@josema-xyz
Copy link
Contributor

@josema-xyz josema-xyz commented Mar 18, 2026

Context

When the RDS read replica enters recovery mode, SQL queries inside with rls_transaction(...) crash with OperationalError: SSL SYSCALL error: EOF detected. The existing retry loop only covers connection-setup failures (before the cursor is yielded). After yield, the yielded_cursor guard re-raises immediately with no retry and no primary fallback. Sentry CLOUD-API-3W7 shows 40 such crashes during a single replica recovery window.

There was also an off-by-one: REPLICA_MAX_ATTEMPTS=3 gave 2 replica tries + 1 primary instead of 3 + 1.

Supersedes #10374, which rewrote rls_transaction as an iterator class and changed all 100+ call sites. This PR fixes the same bug with zero call-site changes.

Description

Adds query-level retry to rls_transaction using Django's execute_wrapper API. When a cursor.execute() call fails on the replica:

  1. Retries on replica with exponential backoff (same REPLICA_MAX_ATTEMPTS / REPLICA_RETRY_BASE_DELAY settings)
  2. If all replica retries fail, falls back to primary: opens a transaction, sets RLS config, re-executes the failed SQL
  3. Swaps the inner psycopg2 cursor so fetchall()/fetchone() read from the new connection transparently
  4. Suppresses the transaction.atomic() cleanup error on the dead replica after a successful failover

Also fixes the off-by-one in max_attempts and closes stale connections between pre-yield retries.

Zero call-site changes. All 100+ with rls_transaction(...) usages work as before.

Limitation: .iterator() fetches rows via fetchmany(), which the wrapper does not intercept. ~4 of 28 replica call sites use .iterator() and need caller-side retry for mid-iteration failures. This is the same behavior as before, as the previous implementation also could not safely retry mid-iteration without risking duplicate rows.

Steps to review

Deep review:

  1. api/src/backend/api/db_utils.py: the _query_failover closure and changes to the retry loop
  2. api/src/backend/api/tests/test_db_utils.py: 6 new tests for mid-query failover + 4 updated for off-by-one

Also run tests and run a local instance, API + UI to check is working.

Checklist

API

  • All issue/task requirements work as expected on the API
  • Endpoint response output (if applicable)
  • EXPLAIN ANALYZE output for new/modified queries or indexes (if applicable)
  • Performance test results (if applicable)
  • Any other relevant evidence of the implementation (if applicable)
  • Verify if API specs need to be regenerated.
  • Check if version updates are required (e.g., specs, Poetry, etc.).
  • Ensure new entries are added to CHANGELOG.md, if applicable.

License

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@josema-xyz josema-xyz requested a review from a team as a code owner March 18, 2026 15:27
Copilot AI review requested due to automatic review settings March 18, 2026 15:27
@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

✅ All necessary CHANGELOG.md files have been updated.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

Conflict Markers Resolved

All conflict markers have been successfully resolved in this pull request.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

🔒 Container Security Scan

Image: prowler-api:17ded06
Last scan: 2026-03-19 12:28:39 UTC

📊 Vulnerability Summary

Severity Count
🔴 Critical 5
Total 5

4 package(s) affected

⚠️ Action Required

Critical severity vulnerabilities detected. These should be addressed before merging:

  • Review the detailed scan results
  • Update affected packages to patched versions
  • Consider using a different base image if updates are unavailable

📋 Resources:

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens rls_transaction against RDS read-replica recovery events by adding query-level retries and automatic primary fallback for mid-query OperationalErrors, without requiring call-site changes.

Changes:

  • Add an execute_wrapper-based mechanism to retry failed replica cursor.execute() calls and fall back to primary when retries are exhausted.
  • Fix an off-by-one in replica retry counting (now REPLICA_MAX_ATTEMPTS replica tries + 1 primary fallback) and close stale connections between pre-yield retries.
  • Add/adjust unit tests covering mid-query failover and the updated retry attempt semantics.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
api/src/backend/api/db_utils.py Implements mid-query retry/failover logic via a query wrapper; adjusts pre-yield retry loop and cleanup behavior.
api/src/backend/api/tests/test_db_utils.py Updates existing retry-count assertions and adds new tests for mid-query failover behavior.
api/CHANGELOG.md Documents the fix in the API changelog under 🐞 Fixed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@codecov
Copy link

codecov bot commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 97.83784% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.42%. Comparing base (0f2fdcf) to head (dd3de0b).
⚠️ Report is 12 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #10379       +/-   ##
===========================================
+ Coverage   56.86%   93.42%   +36.55%     
===========================================
  Files          87      218      +131     
  Lines        2847    30742    +27895     
===========================================
+ Hits         1619    28721    +27102     
- Misses       1228     2021      +793     
Flag Coverage Δ
api 93.42% <97.83%> (?)
prowler-py3.10-oraclecloud ?
prowler-py3.11-oraclecloud ?
prowler-py3.12-oraclecloud ?
prowler-py3.9-oraclecloud ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
prowler ∅ <ø> (∅)
api 93.42% <97.83%> (∅)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants