Skip to content

fix(java): remove Java-side inflight counter, let Rust be sole authority#5656

Merged
avifenesh merged 1 commit intorelease-2.2from
fix/remove-java-inflight-counter-release
Mar 24, 2026
Merged

fix(java): remove Java-side inflight counter, let Rust be sole authority#5656
avifenesh merged 1 commit intorelease-2.2from
fix/remove-java-inflight-counter-release

Conversation

@avifenesh
Copy link
Member

Remove Java-side inflight counter, let Rust be sole authority

Follow-up to #5642 (InflightRequestTracker cherry-pick). Separated into its own PR for clean review.

The Problem

Both Rust and Java had independent inflight counters that released on timeout -- while zombie sub-commands were still alive in the event loop:

         Java counter          Rust counter
Send:       0 -> 1              1000 -> 999
Timeout:    1 -> 0  (freed!)    999 (held -- zombies alive)
Send:       0 -> 1  (allowed!)  999 -> 998  <-- new zombie
  ...60 threads doing this...
  ...180 new zombies/second...
  ...54K zombies after 5 min -> permanent freeze

#5642 fixed Rust (RAII tracker holds until zombie finishes). But Java's AsyncRegistry counter still recycled on timeout, defeating the fix. Default config (maxInflightRequests=1000) meant this was active for all users.

The Fix

Remove Java's counter entirely (-95 lines, 3 files). Rust is the sole inflight authority.

Chaos Test A/B (4-core pods, 60 threads, sustained 5s netem)

Old v2.2 Fix
Under chaos Freeze within 5min Stable at 60/s
After chaos removed Permanent freeze (9+ min) Recovered in 15s

@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

Copy link
Collaborator

@shohamazon shohamazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Clean, well-motivated bugfix. The dual-counter desync is a real and critical issue — consolidating to Rust's AtomicIsize counter (Client::reserve_inflight_request / release_inflight_request) is the right architectural call. The diff is focused and the removals are consistent across all three files.

Suggestions

1. Dead maxInflightRequests field in GlideCoreClient

The PR removes all 7 usages of this.maxInflightRequests in AsyncRegistry.register() calls, but the field itself (line 92), the getter getMaxInflightRequests() (lines 94-96), and the constructor parameter maxInflight (line 114) remain. After this PR, nothing in GlideCoreClient reads the field anymore.

Consider removing the field and simplifying the constructor, or if it's intentionally kept (e.g. for the Rust-side config passed via ConnectionManager.connectToValkey()), a brief comment noting it's unused on the Java side would help future readers.

2. Stale comments after cleanupClient removal

In GlideCoreClient.close() (line ~419) and CleanupAction.run() (line ~455), the comment // Clean up per-client inflight tracking is left behind but the AsyncRegistry.cleanupClient() call it described is removed. The remaining GlideNativeBridge.closeClient() is about closing the native handle, not inflight tracking. These comments are now misleading.

Similarly in ConnectionManager.closeConnectionSync() (line ~429), the comment // Clean up any pending async operations for this client remains with a blank line where the call used to be.

3. Verify no other callers of the removed cleanupClient method

The PR removes AsyncRegistry.cleanupClient() and its 4 call sites. Worth confirming there are no other callers in tests or other modules that would fail to compile after this change.

What looks good

  • Javadoc updated correctly to document Rust as sole authority with InflightRequestTracker reference
  • estimateInitialCapacity() properly removed along with the env var / system property reading
  • clientInflightCounts field and all its usages in register(), whenComplete, shutdown(), reset() cleanly removed
  • AsyncRegistry import removed from ConnectionManager
  • New register() Javadoc clearly documents the simplified contract

Overall: solid fix. The stale comments and dead field are minor nits.

@avifenesh
Copy link
Member Author

Review response

Thanks for the thorough review. Addressing each point:

1. maxInflightRequests field in GlideCoreClient — Not dead. BaseClient.java:503 passes connectionManager.getMaxInflightRequests() to the constructor. The value flows through ConnectionManager.java:274requestBuilder.setInflightRequestsLimit() → protobuf → Rust's Client::new() which initializes the AtomicIsize counter. This is how the user configures Rust's inflight limit. Field stays.

2. estimateInitialCapacity() — Already removed. Replaced with new ConcurrentHashMap<>() (default capacity, resizes automatically).

3. diagPeriodicLog() O(n) scan — Does not exist on the release-2.2 branch. Main branch only, pre-existing, not touched by this PR.

4. cleanupClient() external callers — All removed: 2 in GlideCoreClient (close + CleanupAction), 2 in ConnectionManager (async + sync close), plus the import statement. No other callers exist.

5. shutdown()/reset() clientInflightCounts references — Both .clear() calls removed.

6. Javadoc — Updated: "Inflight request limits are enforced exclusively by the Rust core via InflightRequestTracker."

7. Import cleanupimport glide.internal.AsyncRegistry removed from ConnectionManager.java.

All points addressed in the current commit.

@avifenesh avifenesh force-pushed the fix/remove-java-inflight-counter-release branch from f7eaca1 to 6ef3a01 Compare March 24, 2026 14:18
@avifenesh
Copy link
Member Author

Addressed all three suggestions in the latest push:

  1. Stale comments — Removed // Clean up per-client inflight tracking from GlideCoreClient.close() and CleanupAction.run(), and // Clean up any pending async operations from ConnectionManager.

  2. maxInflightRequests field — Added Javadoc clarifying it's passed to Rust via protobuf and not enforced on the Java side.

  3. No other callers of cleanupClient — Verified: only the 4 removed call sites existed (2 in GlideCoreClient, 2 in ConnectionManager). No test or other module references.

Copy link
Collaborator

@shohamazon shohamazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three suggestions addressed — verified against the updated diff. LGTM.

One tiny nit: ConnectionManager.closeConnectionSync() has a leftover double blank line (~line 427-429) where the removed code was. Not a blocker.

@avifenesh avifenesh force-pushed the fix/remove-java-inflight-counter-release branch from 6ef3a01 to 93e0aca Compare March 24, 2026 14:42
@avifenesh avifenesh force-pushed the fix/remove-java-inflight-counter-release branch from 93e0aca to 236e1a7 Compare March 24, 2026 15:01
AsyncRegistry had its own per-client inflight counter that recycled
immediately on timeout via whenComplete callback. This defeated the
Rust InflightRequestTracker — Java thought slots were free while Rust
still held them for zombie sub-commands.

Remove the Java counter entirely. Rust's send_command() with
InflightRequestTracker::try_new() is now the sole inflight authority.

Chaos test A/B on 4-core pods with 5s netem:
- Old v2.2: permanent freeze after 35min, no recovery after chaos removed
- Fix: stable 15min, recovered to 23K/s in 15 seconds

Signed-off-by: Avi Fenesh <aviarchi1994@gmail.com>
@avifenesh avifenesh force-pushed the fix/remove-java-inflight-counter-release branch from 236e1a7 to 3aa3ad6 Compare March 24, 2026 15:27
@avifenesh avifenesh merged commit 072ccdf into release-2.2 Mar 24, 2026
22 checks passed
avifenesh added a commit that referenced this pull request Mar 25, 2026
With Java's inflight counter removed (PR #5656), rejection now happens
asynchronously in Rust's send_command() rather than synchronously in
Java's AsyncRegistry. The tokio task ordering is non-deterministic, so
any of the N+1 requests may be the one rejected — not necessarily the
last one sent.

Update the test to wait for async processing and count rejected
requests instead of asserting specific request ordering.

Signed-off-by: Avi Fenesh <aviarchi1994@gmail.com>
avifenesh added a commit that referenced this pull request Mar 25, 2026
With Java's inflight counter removed (PR #5656), rejection now happens
asynchronously in Rust's send_command() rather than synchronously in
Java's AsyncRegistry. The tokio task ordering is non-deterministic, so
any of the N+1 requests may be the one rejected — not necessarily the
last one sent.

Update the test to wait for async processing and count rejected
requests instead of asserting specific request ordering.

Signed-off-by: Avi Fenesh <aviarchi1994@gmail.com>
avifenesh added a commit that referenced this pull request Mar 25, 2026
With Java's inflight counter removed (PR #5656), rejection now happens
asynchronously in Rust's send_command() rather than synchronously in
Java's AsyncRegistry. The tokio task ordering is non-deterministic, so
any of the N+1 requests may be the one rejected — not necessarily the
last one sent.

Update the test to wait for async processing and count rejected
requests instead of asserting specific request ordering.

Signed-off-by: Avi Fenesh <aviarchi1994@gmail.com>
avifenesh added a commit that referenced this pull request Mar 25, 2026
With Java's inflight counter removed (PR #5656), rejection now happens
asynchronously in Rust's send_command() rather than synchronously in
Java's AsyncRegistry. The tokio task ordering is non-deterministic, so
any of the N+1 requests may be the one rejected — not necessarily the
last one sent.

Update the test to wait for async processing and count rejected
requests instead of asserting specific request ordering.

Signed-off-by: Avi Fenesh <aviarchi1994@gmail.com>
avifenesh added a commit that referenced this pull request Mar 25, 2026
With Java's inflight counter removed (PR #5656), rejection now happens
asynchronously in Rust's send_command() rather than synchronously in
Java's AsyncRegistry. The tokio task ordering is non-deterministic, so
any of the N+1 requests may be the one rejected — not necessarily the
last one sent.

Update the test to wait for async processing and count rejected
requests instead of asserting specific request ordering.

Signed-off-by: Avi Fenesh <aviarchi1994@gmail.com>
avifenesh added a commit that referenced this pull request Mar 25, 2026
…on (#5658)

With Java's inflight counter removed (PR #5656), rejection now happens
asynchronously in Rust's send_command() rather than synchronously in
Java's AsyncRegistry. The tokio task ordering is non-deterministic, so
any of the N+1 requests may be the one rejected — not necessarily the
last one sent.

Update the test to wait for async processing and count rejected
requests instead of asserting specific request ordering.

Signed-off-by: Avi Fenesh <aviarchi1994@gmail.com>
avifenesh added a commit that referenced this pull request Mar 25, 2026
avifenesh added a commit that referenced this pull request Mar 25, 2026
avifenesh added a commit that referenced this pull request Mar 25, 2026
…e authority (#5656)"

This reverts commit 072ccdf.

Signed-off-by: Avi Fenesh <aviarchi1994@gmail.com>
avifenesh added a commit that referenced this pull request Mar 25, 2026
* Revert "fix(java): update inflight_requests_limit test for async Rust rejection (#5658)"

This reverts commit 9331bf4.

Signed-off-by: Avi Fenesh <aviarchi1994@gmail.com>

* Revert "fix(java): remove Java-side inflight counter, let Rust be sole authority (#5656)"

This reverts commit 072ccdf.

Signed-off-by: Avi Fenesh <aviarchi1994@gmail.com>

---------

Signed-off-by: Avi Fenesh <aviarchi1994@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants