Skip to content

fix(jwt): race conditions and IAM reliability issues with identity verification enabled#2609

Open
nan-li wants to merge 7 commits intofeat/identity_verification_5.8from
fix/jwt_misc_race_conditions
Open

fix(jwt): race conditions and IAM reliability issues with identity verification enabled#2609
nan-li wants to merge 7 commits intofeat/identity_verification_5.8from
fix/jwt_misc_race_conditions

Conversation

@nan-li
Copy link
Copy Markdown
Contributor

@nan-li nan-li commented Apr 9, 2026

Description

One Line Summary

Fix multiple issues with identity verification (IV) enabled: startup race conditions, missing JWT invalidated callback, stuck IAM fetch after login, and IAM retry on expired JWT.

Details

Motivation

With identity verification enabled, several issues surfaced during testing:

  • Anonymous operations were sometimes not purged on cold start due to timing
  • The onUserJwtInvalidated listener may not be called with JWT was missing at SDK init time
  • In-App Messages may not be fetched after the first login
  • If a JWT expired mid-session, the IAM fetch could silently failed with no recovery

Scope

All changes are gated behind identity verification being enabled (useIdentityVerification == true). Normal (non-IV) SDK behavior is unaffected.

Commit 1 -- Fix race condition: purge anonymous ops AFTER queue is loaded

  • IdentityVerificationService.onModelReplaced (config HYDRATE) could fire before OperationRepo.loadSavedOperations() finished, causing removeOperationsWithoutExternalId() to run against an empty queue
  • Fixed by wrapping the HYDRATE handler in suspendifyOnIO + awaitInitialized(), following the same pattern as RecoverFromDroppedLoginBug

Commit 2 -- Replay JWT invalidated event to late-registered listeners

  • fireJwtInvalidated now buffers the externalId when no listeners are subscribed (e.g. during SDK init HYDRATE) and replays it when the first IUserJwtInvalidatedListener is added

Commit 3 -- Fix IAM fetch stuck after login with IV enabled

  • LoginUserOperationExecutor.createUser() was not passing the RYW token from the backend response to the ConsistencyManager, causing IamFetchReadyCondition to not resolve
  • Added rywData field to CreateUserResponse, parsed ryw_token/ryw_delay in JSONConverter, and set RYW data in ConsistencyManager after successful createUser

Commit 4 -- Retry IAM fetch after JWT refresh on 401/403 response

  • InAppBackendService now throws BackendException on unauthorized (401/403) responses instead of returning null
  • InAppMessagesManager catches the exception, stores a pending retry state, and retries the fetch when JwtTokenStore notifies that the JWT has been refreshed for the same user
  • Pending retry is cleared on user switch to avoid stale retries

Commit 5 -- Fix PR review issues: fallback 401 handling, volatile fields, TOCTOU race

  • fetchInAppMessagesWithoutRywToken (the fallback path after RYW retries are exhausted) was missing the 401/403 check, silently returning null and bypassing the JWT retry mechanism
  • Added @Volatile to pendingJwtRetryExternalId and pendingJwtRetryRywData for cross-thread memory visibility
  • Fixed a TOCTOU race in UserManager between fireJwtInvalidated and addJwtInvalidatedListener: without synchronization, a listener could subscribe between the hasSubscribers check and the pendingJwtInvalidatedExternalId store, causing the event to be lost permanently

Testing

Unit testing

  • Updated LoginUserOperationExecutorTests to include the new IConsistencyManager dependency (all 16 constructor calls)
  • Added test in InAppBackendServiceTests: 401 response throws BackendException
  • Added test in InAppBackendServiceTests: 401 from fallback (no RYW token) path throws BackendException
  • Added 5 tests in InAppMessagesManagerTests covering the JWT 401 retry flow:
    • 401 stores pending retry state
    • JWT refresh for matching user triggers retry
    • JWT refresh for different user does not retry
    • No-op when no pending retry exists
    • User switch clears pending retry state

Manual testing

  • Tested with IV enabled on a fresh install: verified no IAM fetch for anonymous user, IAM fetched after login
  • Tested JWT expiry mid-session: verified IAM fetch retries after updateUserJwt is called
  • Tested login/logout cycle: verified pending retry is cleared on user switch

Affected code checklist

  • In-App Messaging
  • REST API requests
  • Public API changes

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing (fix IV-related issues for 5.8 release)
  • Any Public API changes are explained in the PR details and conform to existing APIs

Testing

  • I have included test coverage for these changes, or explained why they are not needed
  • All automated tests pass, or I explained why that is not possible
  • I have personally tested this on my device, or explained why that is not possible

Final pass

  • Code is as readable as possible
  • I have reviewed this PR myself, ensuring it meets each checklist item

nan-li added 3 commits April 9, 2026 10:08
IdentityVerificationService.onModelReplaced (config HYDRATE) could
fire before OperationRepo.loadSavedOperations() finished, causing
removeOperationsWithoutExternalId() to run against an empty queue.

Wrap the HYDRATE handler in suspendifyOnIO + awaitInitialized() so
the purge waits for the queue to be fully populated, following the
same pattern as RecoverFromDroppedLoginBug.

Made-with: Cursor
fireJwtInvalidated now buffers the externalId when no listeners
are subscribed (e.g. during SDK init HYDRATE) and replays it when
the first IUserJwtInvalidatedListener is added. Preserves async
dispatch and runCatching for runtime 401 path.

Made-with: Cursor
LoginUserOperationExecutor.createUser() was not passing the RYW token
from the backend response to the ConsistencyManager, causing the
IamFetchReadyCondition to never resolve after login. This meant
InAppMessagesManager coroutines awaited forever and IAMs were never
fetched for the logged-in user.

- Add rywData field to CreateUserResponse
- Parse ryw_token/ryw_delay in JSONConverter.convertToCreateUserResponse
- Set RYW data in ConsistencyManager after successful createUser

Made-with: Cursor
@nan-li nan-li changed the title Fix/jwt misc race conditions fix(jwt): misc race conditions Apr 9, 2026
@nan-li nan-li changed the title fix(jwt): misc race conditions fix(jwt): Fix race conditions and reliability issues with identity verification enabled Apr 9, 2026
@nan-li nan-li changed the title fix(jwt): Fix race conditions and reliability issues with identity verification enabled fix(jwt): Fix race conditions and IAM reliability issues with identity verification enabled Apr 9, 2026
@nan-li nan-li changed the title fix(jwt): Fix race conditions and IAM reliability issues with identity verification enabled fix(jwt): race conditions and IAM reliability issues with identity verification enabled Apr 9, 2026
@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch from 847102b to e841049 Compare April 9, 2026 19:56
When an IAM fetch returns an unauthorized response (401 or 403), the SDK
now saves the pending fetch state and automatically retries once the JWT
is refreshed for the same user. Switching users clears any stale retry.

Made-with: Cursor
@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch from 917e157 to c61f42e Compare April 9, 2026 20:01
@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch from f948912 to 2b6e2cf Compare April 9, 2026 20:21
@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch from 2b6e2cf to 60c8d9a Compare April 10, 2026 03:01
@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 10, 2026

@claude re-review

1 similar comment
@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 10, 2026

@claude re-review

@nan-li nan-li requested review from a team and abdulraqeeb33 April 10, 2026 04:22
@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 10, 2026

@claude[agent] review

@Claude
Copy link
Copy Markdown

Claude AI commented Apr 10, 2026

@nan-li The model is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: a77cf2eb-5f09-4178-bd26-a35f6f4f18ec

5 similar comments
@Claude
Copy link
Copy Markdown

Claude AI commented Apr 10, 2026

@nan-li The model is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: a77cf2eb-5f09-4178-bd26-a35f6f4f18ec

@Claude
Copy link
Copy Markdown

Claude AI commented Apr 10, 2026

@nan-li The model is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: a77cf2eb-5f09-4178-bd26-a35f6f4f18ec

@Claude
Copy link
Copy Markdown

Claude AI commented Apr 10, 2026

@nan-li The model is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: a77cf2eb-5f09-4178-bd26-a35f6f4f18ec

@Claude
Copy link
Copy Markdown

Claude AI commented Apr 10, 2026

@nan-li The model is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: a77cf2eb-5f09-4178-bd26-a35f6f4f18ec

@Claude
Copy link
Copy Markdown

Claude AI commented Apr 10, 2026

@nan-li The model is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: a77cf2eb-5f09-4178-bd26-a35f6f4f18ec

@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 10, 2026

@claude review

…race

- Add UNAUTHORIZED check to fetchInAppMessagesWithoutRywToken so the
  JWT retry mechanism works when the fallback (no RYW token) path hits
  a 401/403.
- Add @volatile to pendingJwtRetryExternalId and pendingJwtRetryRywData
  for cross-thread memory visibility.
- Fix TOCTOU race in UserManager between fireJwtInvalidated and
  addJwtInvalidatedListener using a synchronized lock, preventing a
  lost JWT invalidated event when the two methods interleave.

Made-with: Cursor
@nan-li nan-li force-pushed the fix/jwt_misc_race_conditions branch from 650e96f to cfc3e6e Compare April 10, 2026 18:01
@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 10, 2026

@claude review

- Add @volatile to lastTimeFetchedIAMs so the null write in the 401
  catch block is visible to the JWT retry coroutine on another thread.
- Reset lastTimeFetchedIAMs on 401 so the retry is not rate-limited.
- Remove fetchIAMMinInterval=0L test workaround that masked the bug.
- Wrap resolveConditionsWithID in mutex.withLock to match setRywData
  and getRywDataFromAwaitableCondition, preventing a data race on the
  shared conditions list.

Made-with: Cursor
@nan-li
Copy link
Copy Markdown
Contributor Author

nan-li commented Apr 10, 2026

@claude review

Comment on lines +350 to +353
if (NetworkUtils.getResponseStatusType(ex.statusCode) == NetworkUtils.ResponseStatusType.UNAUTHORIZED) {
Logging.debug("InAppMessagesManager.fetchMessages: ${ex.statusCode} response. Will retry after JWT refresh for externalId=$externalId")
pendingJwtRetryExternalId = externalId
pendingJwtRetryRywData = rywData
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The volatile write order in the 401 catch block is wrong: pendingJwtRetryExternalId (the sentinel) is written at line 352 BEFORE pendingJwtRetryRywData (the data) at line 353, but onJwtUpdated uses the sentinel as a happens-before anchor. Under the JVM Memory Model, a volatile read of the sentinel on Thread R only establishes happens-before with writes that preceded that volatile write on Thread W; since pendingJwtRetryRywData is written AFTER the sentinel, Thread R can observe the sentinel as non-null but pendingJwtRetryRywData=null, triggering the early-return guard and permanently dropping the IAM retry for that JWT update. Fix: reverse the write order so pendingJwtRetryRywData is written first and pendingJwtRetryExternalId last, using the sentinel as a proper safe-publication barrier.

Extended reasoning...

What the bug is and how it manifests

In fetchMessages() at lines 352-353, both fields are @volatile (correctly added by Commit 5), but the write order between them is wrong. The sentinel pendingJwtRetryExternalId is written first, then pendingJwtRetryRywData second. In onJwtUpdated() at lines 1049-1051, the reader checks if (retryExternalId == null || retryRywData == null || ...) and returns early if either field is null, permanently dropping the retry.

The specific code path that triggers it

Thread W (IO dispatcher) in the 401 catch block executes in program order:

  1. pendingJwtRetryExternalId = externalId - volatile write W1 (sentinel, written FIRST)
  2. pendingJwtRetryRywData = rywData - volatile write W2 (data, written SECOND)

Thread R (the thread calling updateUserJwt) in onJwtUpdated executes:

  1. val retryExternalId = pendingJwtRetryExternalId - volatile read R1
  2. val retryRywData = pendingJwtRetryRywData - volatile read R2

Why existing code does not prevent it

Under the JVM Memory Model, a volatile write W1 synchronizes-with a volatile read R1 of the same field, establishing happens-before for all actions that preceded W1 on Thread W - but NOT for actions that follow W1. W2 (writing pendingJwtRetryRywData) occurs AFTER W1 on Thread W, so the synchronization edge W1 hb R1 does not put W2 in the happens-before chain visible to Thread R at R2. Thread R can observe R1 = non-null (W1 visible) while R2 = null (W2 not yet propagated). On ARM (all Android devices), stlr/ldar implement volatile stores/loads: a ldar for R1 acquires all stores preceding W1's stlr, but W2 is a later independent stlr - Thread R's ldar for R2 is not required to observe it.

What the impact would be

Thread R returns early from onJwtUpdated because retryRywData == null. The field pendingJwtRetryExternalId still holds the external ID (W2 eventually completes), but onJwtUpdated has already returned without scheduling a retry. Since developers are expected to call updateUserJwt() once in response to onUserJwtInvalidated, the IAM fetch is permanently lost for that session and in-app messages are never shown.

Step-by-step proof

  1. Thread W enters the 401 catch block and performs W1: pendingJwtRetryExternalId = "alice" (volatile write).
  2. Thread W performs W2: pendingJwtRetryRywData = rywData (volatile write, program-order AFTER W1).
  3. Before W2 is globally observable, Thread R calls updateUserJwt("alice"), which fires onJwtUpdated("alice").
  4. Thread R performs R1 on pendingJwtRetryExternalId. The volatile read establishes the synchronization edge W1 hb R1, so R1 = "alice" (non-null).
  5. Thread R performs R2 on pendingJwtRetryRywData. There is no happens-before edge to W2 (W2 follows W1 on Thread W, outside the hb chain), so R2 may still read null.
  6. Guard retryRywData == null is true. onJwtUpdated returns immediately. The IAM retry is permanently dropped.

How to fix it

Reverse the write order in the 401 catch block so pendingJwtRetryRywData is written FIRST (the data) and pendingJwtRetryExternalId is written LAST (the sentinel):

pendingJwtRetryRywData = rywData       // write data first
pendingJwtRetryExternalId = externalId // sentinel last: reader seeing non-null externalId is JMM-guaranteed to see rywData

This is the standard JMM safe-publication pattern. The sentinel is the last volatile write on the writer side, so the happens-before edge it creates with the reader's volatile read of the sentinel covers all prior writes on Thread W, including the data field.

Comment on lines +62 to +73
for ((condition, deferred) in conditions) {
if (condition.id == id) {
if (!deferred.isCompleted) {
deferred.complete(null)
}
}
completedConditions.add(Pair(condition, deferred))
}
completedConditions.add(Pair(condition, deferred))
}

// Remove completed conditions from the list
conditions.removeAll(completedConditions)
conditions.removeAll(completedConditions)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 In ConsistencyManager.resolveConditionsWithID(), the line completedConditions.add(Pair(condition, deferred)) at line 68 is placed OUTSIDE the if (condition.id == id) block (which closes at line 67), so every condition in the list—regardless of ID match—is added to completedConditions and then removed by conditions.removeAll(completedConditions). This is a pre-existing logic error that the PR wrapped with mutex.withLock but did not fix; the PR adds a new call site at LoginUserOperationExecutor.kt:234 that increases exposure.

Extended reasoning...

What the bug is and how it manifests

In ConsistencyManager.resolveConditionsWithID() (lines 58–74), the statement completedConditions.add(Pair(condition, deferred)) at line 68 sits outside the if (condition.id == id) block. That if block ends at line 67 (closing brace). As a result, every iteration of the loop—regardless of whether condition.id matches—appends to completedConditions, and the subsequent conditions.removeAll(completedConditions) removes every registered condition, not just those matching id.

The specific code path that triggers it

Calling resolveConditionsWithID("IamFetchReadyCondition") when two conditions with different IDs are registered (e.g., type A with ID "IamFetchReadyCondition" and type B with a different static ID) would: (1) complete type A's deferred with null (correct), (2) silently remove type B from the list WITHOUT completing its deferred (incorrect). Type B's coroutine remains suspended on .await() indefinitely—the condition can never be fulfilled because it no longer exists in the list to be checked.

Why existing code does not prevent it

The indentation in the diff makes it look as though completedConditions.add() is inside the if block, but it is not—there is no closing brace before it for the if. The contrast with checkConditionsAndComplete() (line 88) makes the intent clear: that method correctly places completedConditions.add() inside if (condition.isMet(indexedTokens)), only accumulating entries that actually satisfied the predicate. The PR adds a mutex.withLock wrapper (fixing the data-race issue separately reported) but leaves the misplaced add() unchanged.

What the impact would be

In the current codebase all registered conditions are IamFetchReadyCondition instances sharing a single static ID constant (IamFetchReadyCondition.ID = "IamFetchReadyCondition"), so every entry matches the if check anyway and the bug is a no-op today. However, if any second condition type with a different static ID is ever registered and resolveConditionsWithID(typeA_ID) is called, type B conditions are silently removed with their deferreds uncompleted, permanently hanging any coroutine that called getRywDataFromAwaitableCondition() for a type B condition. The PR adds a new call site in LoginUserOperationExecutor.kt:234, increasing the surface area and making the latent bug more likely to manifest as the SDK's consistency manager usage grows.

Step-by-step proof

  1. Type A condition (id = "IamFetchReadyCondition") and type B condition (id = "SomeOtherCondition") are both registered; each creates a CompletableDeferred. conditions list has two entries.
  2. resolveConditionsWithID("IamFetchReadyCondition") is called.
  3. Loop iteration 1: condition.id == "IamFetchReadyCondition" → true; deferred A is completed with null. Then completedConditions.add() appends entry A (correct).
  4. Loop iteration 2: condition.id == "IamFetchReadyCondition" → false (type B); deferred B is NOT completed. Then completedConditions.add() appends entry B anyway (incorrect).
  5. conditions.removeAll(completedConditions) removes both A and B from the live list.
  6. Coroutine awaiting deferred B remains suspended forever; it cannot be resolved because B no longer exists in conditions to be matched by any future setRywData or resolveConditionsWithID call.

How to fix it

Move completedConditions.add(Pair(condition, deferred)) inside the if (condition.id == id) block, mirroring the pattern in checkConditionsAndComplete():

for ((condition, deferred) in conditions) {
    if (condition.id == id) {
        if (\!deferred.isCompleted) {
            deferred.complete(null)
        }
        completedConditions.add(Pair(condition, deferred))  // inside the if
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants