fix(jwt): race conditions and IAM reliability issues with identity verification enabled#2609
fix(jwt): race conditions and IAM reliability issues with identity verification enabled#2609nan-li wants to merge 7 commits intofeat/identity_verification_5.8from
Conversation
IdentityVerificationService.onModelReplaced (config HYDRATE) could fire before OperationRepo.loadSavedOperations() finished, causing removeOperationsWithoutExternalId() to run against an empty queue. Wrap the HYDRATE handler in suspendifyOnIO + awaitInitialized() so the purge waits for the queue to be fully populated, following the same pattern as RecoverFromDroppedLoginBug. Made-with: Cursor
fireJwtInvalidated now buffers the externalId when no listeners are subscribed (e.g. during SDK init HYDRATE) and replays it when the first IUserJwtInvalidatedListener is added. Preserves async dispatch and runCatching for runtime 401 path. Made-with: Cursor
LoginUserOperationExecutor.createUser() was not passing the RYW token from the backend response to the ConsistencyManager, causing the IamFetchReadyCondition to never resolve after login. This meant InAppMessagesManager coroutines awaited forever and IAMs were never fetched for the logged-in user. - Add rywData field to CreateUserResponse - Parse ryw_token/ryw_delay in JSONConverter.convertToCreateUserResponse - Set RYW data in ConsistencyManager after successful createUser Made-with: Cursor
847102b to
e841049
Compare
When an IAM fetch returns an unauthorized response (401 or 403), the SDK now saves the pending fetch state and automatically retries once the JWT is refreshed for the same user. Switching users clears any stale retry. Made-with: Cursor
917e157 to
c61f42e
Compare
...sages/src/main/java/com/onesignal/inAppMessages/internal/backend/impl/InAppBackendService.kt
Show resolved
Hide resolved
...ain/java/com/onesignal/user/internal/operations/impl/executors/LoginUserOperationExecutor.kt
Show resolved
Hide resolved
...l/in-app-messages/src/main/java/com/onesignal/inAppMessages/internal/InAppMessagesManager.kt
Show resolved
Hide resolved
OneSignalSDK/onesignal/core/src/main/java/com/onesignal/user/internal/UserManager.kt
Outdated
Show resolved
Hide resolved
f948912 to
2b6e2cf
Compare
2b6e2cf to
60c8d9a
Compare
|
@claude re-review |
1 similar comment
|
@claude re-review |
|
@claude[agent] review |
|
@nan-li The model is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it. You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
5 similar comments
|
@nan-li The model is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it. You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
|
@nan-li The model is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it. You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
|
@nan-li The model is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it. You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
|
@nan-li The model is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it. You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
|
@nan-li The model is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it. You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
|
@claude review |
...l/in-app-messages/src/main/java/com/onesignal/inAppMessages/internal/InAppMessagesManager.kt
Show resolved
Hide resolved
…race - Add UNAUTHORIZED check to fetchInAppMessagesWithoutRywToken so the JWT retry mechanism works when the fallback (no RYW token) path hits a 401/403. - Add @volatile to pendingJwtRetryExternalId and pendingJwtRetryRywData for cross-thread memory visibility. - Fix TOCTOU race in UserManager between fireJwtInvalidated and addJwtInvalidatedListener using a synchronized lock, preventing a lost JWT invalidated event when the two methods interleave. Made-with: Cursor
650e96f to
cfc3e6e
Compare
|
@claude review |
...l/in-app-messages/src/main/java/com/onesignal/inAppMessages/internal/InAppMessagesManager.kt
Show resolved
Hide resolved
...ain/java/com/onesignal/user/internal/operations/impl/executors/LoginUserOperationExecutor.kt
Show resolved
Hide resolved
- Add @volatile to lastTimeFetchedIAMs so the null write in the 401 catch block is visible to the JWT retry coroutine on another thread. - Reset lastTimeFetchedIAMs on 401 so the retry is not rate-limited. - Remove fetchIAMMinInterval=0L test workaround that masked the bug. - Wrap resolveConditionsWithID in mutex.withLock to match setRywData and getRywDataFromAwaitableCondition, preventing a data race on the shared conditions list. Made-with: Cursor
|
@claude review |
| if (NetworkUtils.getResponseStatusType(ex.statusCode) == NetworkUtils.ResponseStatusType.UNAUTHORIZED) { | ||
| Logging.debug("InAppMessagesManager.fetchMessages: ${ex.statusCode} response. Will retry after JWT refresh for externalId=$externalId") | ||
| pendingJwtRetryExternalId = externalId | ||
| pendingJwtRetryRywData = rywData |
There was a problem hiding this comment.
🔴 The volatile write order in the 401 catch block is wrong: pendingJwtRetryExternalId (the sentinel) is written at line 352 BEFORE pendingJwtRetryRywData (the data) at line 353, but onJwtUpdated uses the sentinel as a happens-before anchor. Under the JVM Memory Model, a volatile read of the sentinel on Thread R only establishes happens-before with writes that preceded that volatile write on Thread W; since pendingJwtRetryRywData is written AFTER the sentinel, Thread R can observe the sentinel as non-null but pendingJwtRetryRywData=null, triggering the early-return guard and permanently dropping the IAM retry for that JWT update. Fix: reverse the write order so pendingJwtRetryRywData is written first and pendingJwtRetryExternalId last, using the sentinel as a proper safe-publication barrier.
Extended reasoning...
What the bug is and how it manifests
In fetchMessages() at lines 352-353, both fields are @volatile (correctly added by Commit 5), but the write order between them is wrong. The sentinel pendingJwtRetryExternalId is written first, then pendingJwtRetryRywData second. In onJwtUpdated() at lines 1049-1051, the reader checks if (retryExternalId == null || retryRywData == null || ...) and returns early if either field is null, permanently dropping the retry.
The specific code path that triggers it
Thread W (IO dispatcher) in the 401 catch block executes in program order:
- pendingJwtRetryExternalId = externalId - volatile write W1 (sentinel, written FIRST)
- pendingJwtRetryRywData = rywData - volatile write W2 (data, written SECOND)
Thread R (the thread calling updateUserJwt) in onJwtUpdated executes:
- val retryExternalId = pendingJwtRetryExternalId - volatile read R1
- val retryRywData = pendingJwtRetryRywData - volatile read R2
Why existing code does not prevent it
Under the JVM Memory Model, a volatile write W1 synchronizes-with a volatile read R1 of the same field, establishing happens-before for all actions that preceded W1 on Thread W - but NOT for actions that follow W1. W2 (writing pendingJwtRetryRywData) occurs AFTER W1 on Thread W, so the synchronization edge W1 hb R1 does not put W2 in the happens-before chain visible to Thread R at R2. Thread R can observe R1 = non-null (W1 visible) while R2 = null (W2 not yet propagated). On ARM (all Android devices), stlr/ldar implement volatile stores/loads: a ldar for R1 acquires all stores preceding W1's stlr, but W2 is a later independent stlr - Thread R's ldar for R2 is not required to observe it.
What the impact would be
Thread R returns early from onJwtUpdated because retryRywData == null. The field pendingJwtRetryExternalId still holds the external ID (W2 eventually completes), but onJwtUpdated has already returned without scheduling a retry. Since developers are expected to call updateUserJwt() once in response to onUserJwtInvalidated, the IAM fetch is permanently lost for that session and in-app messages are never shown.
Step-by-step proof
- Thread W enters the 401 catch block and performs W1: pendingJwtRetryExternalId = "alice" (volatile write).
- Thread W performs W2: pendingJwtRetryRywData = rywData (volatile write, program-order AFTER W1).
- Before W2 is globally observable, Thread R calls updateUserJwt("alice"), which fires onJwtUpdated("alice").
- Thread R performs R1 on pendingJwtRetryExternalId. The volatile read establishes the synchronization edge W1 hb R1, so R1 = "alice" (non-null).
- Thread R performs R2 on pendingJwtRetryRywData. There is no happens-before edge to W2 (W2 follows W1 on Thread W, outside the hb chain), so R2 may still read null.
- Guard
retryRywData == nullis true. onJwtUpdated returns immediately. The IAM retry is permanently dropped.
How to fix it
Reverse the write order in the 401 catch block so pendingJwtRetryRywData is written FIRST (the data) and pendingJwtRetryExternalId is written LAST (the sentinel):
pendingJwtRetryRywData = rywData // write data first
pendingJwtRetryExternalId = externalId // sentinel last: reader seeing non-null externalId is JMM-guaranteed to see rywDataThis is the standard JMM safe-publication pattern. The sentinel is the last volatile write on the writer side, so the happens-before edge it creates with the reader's volatile read of the sentinel covers all prior writes on Thread W, including the data field.
| for ((condition, deferred) in conditions) { | ||
| if (condition.id == id) { | ||
| if (!deferred.isCompleted) { | ||
| deferred.complete(null) | ||
| } | ||
| } | ||
| completedConditions.add(Pair(condition, deferred)) | ||
| } | ||
| completedConditions.add(Pair(condition, deferred)) | ||
| } | ||
|
|
||
| // Remove completed conditions from the list | ||
| conditions.removeAll(completedConditions) | ||
| conditions.removeAll(completedConditions) | ||
| } |
There was a problem hiding this comment.
🟣 In ConsistencyManager.resolveConditionsWithID(), the line completedConditions.add(Pair(condition, deferred)) at line 68 is placed OUTSIDE the if (condition.id == id) block (which closes at line 67), so every condition in the list—regardless of ID match—is added to completedConditions and then removed by conditions.removeAll(completedConditions). This is a pre-existing logic error that the PR wrapped with mutex.withLock but did not fix; the PR adds a new call site at LoginUserOperationExecutor.kt:234 that increases exposure.
Extended reasoning...
What the bug is and how it manifests
In ConsistencyManager.resolveConditionsWithID() (lines 58–74), the statement completedConditions.add(Pair(condition, deferred)) at line 68 sits outside the if (condition.id == id) block. That if block ends at line 67 (closing brace). As a result, every iteration of the loop—regardless of whether condition.id matches—appends to completedConditions, and the subsequent conditions.removeAll(completedConditions) removes every registered condition, not just those matching id.
The specific code path that triggers it
Calling resolveConditionsWithID("IamFetchReadyCondition") when two conditions with different IDs are registered (e.g., type A with ID "IamFetchReadyCondition" and type B with a different static ID) would: (1) complete type A's deferred with null (correct), (2) silently remove type B from the list WITHOUT completing its deferred (incorrect). Type B's coroutine remains suspended on .await() indefinitely—the condition can never be fulfilled because it no longer exists in the list to be checked.
Why existing code does not prevent it
The indentation in the diff makes it look as though completedConditions.add() is inside the if block, but it is not—there is no closing brace before it for the if. The contrast with checkConditionsAndComplete() (line 88) makes the intent clear: that method correctly places completedConditions.add() inside if (condition.isMet(indexedTokens)), only accumulating entries that actually satisfied the predicate. The PR adds a mutex.withLock wrapper (fixing the data-race issue separately reported) but leaves the misplaced add() unchanged.
What the impact would be
In the current codebase all registered conditions are IamFetchReadyCondition instances sharing a single static ID constant (IamFetchReadyCondition.ID = "IamFetchReadyCondition"), so every entry matches the if check anyway and the bug is a no-op today. However, if any second condition type with a different static ID is ever registered and resolveConditionsWithID(typeA_ID) is called, type B conditions are silently removed with their deferreds uncompleted, permanently hanging any coroutine that called getRywDataFromAwaitableCondition() for a type B condition. The PR adds a new call site in LoginUserOperationExecutor.kt:234, increasing the surface area and making the latent bug more likely to manifest as the SDK's consistency manager usage grows.
Step-by-step proof
- Type A condition (id = "IamFetchReadyCondition") and type B condition (id = "SomeOtherCondition") are both registered; each creates a
CompletableDeferred.conditionslist has two entries. resolveConditionsWithID("IamFetchReadyCondition")is called.- Loop iteration 1:
condition.id == "IamFetchReadyCondition"→ true; deferred A is completed with null. ThencompletedConditions.add()appends entry A (correct). - Loop iteration 2:
condition.id == "IamFetchReadyCondition"→ false (type B); deferred B is NOT completed. ThencompletedConditions.add()appends entry B anyway (incorrect). conditions.removeAll(completedConditions)removes both A and B from the live list.- Coroutine awaiting deferred B remains suspended forever; it cannot be resolved because B no longer exists in
conditionsto be matched by any futuresetRywDataorresolveConditionsWithIDcall.
How to fix it
Move completedConditions.add(Pair(condition, deferred)) inside the if (condition.id == id) block, mirroring the pattern in checkConditionsAndComplete():
for ((condition, deferred) in conditions) {
if (condition.id == id) {
if (\!deferred.isCompleted) {
deferred.complete(null)
}
completedConditions.add(Pair(condition, deferred)) // inside the if
}
}
Description
One Line Summary
Fix multiple issues with identity verification (IV) enabled: startup race conditions, missing JWT invalidated callback, stuck IAM fetch after login, and IAM retry on expired JWT.
Details
Motivation
With identity verification enabled, several issues surfaced during testing:
onUserJwtInvalidatedlistener may not be called with JWT was missing at SDK init timeScope
All changes are gated behind identity verification being enabled (
useIdentityVerification == true). Normal (non-IV) SDK behavior is unaffected.Commit 1 -- Fix race condition: purge anonymous ops AFTER queue is loaded
IdentityVerificationService.onModelReplaced(config HYDRATE) could fire beforeOperationRepo.loadSavedOperations()finished, causingremoveOperationsWithoutExternalId()to run against an empty queuesuspendifyOnIO+awaitInitialized(), following the same pattern asRecoverFromDroppedLoginBugCommit 2 -- Replay JWT invalidated event to late-registered listeners
fireJwtInvalidatednow buffers theexternalIdwhen no listeners are subscribed (e.g. during SDK init HYDRATE) and replays it when the firstIUserJwtInvalidatedListeneris addedCommit 3 -- Fix IAM fetch stuck after login with IV enabled
LoginUserOperationExecutor.createUser()was not passing the RYW token from the backend response to theConsistencyManager, causingIamFetchReadyConditionto not resolverywDatafield toCreateUserResponse, parsedryw_token/ryw_delayinJSONConverter, and set RYW data inConsistencyManagerafter successfulcreateUserCommit 4 -- Retry IAM fetch after JWT refresh on 401/403 response
InAppBackendServicenow throwsBackendExceptionon unauthorized (401/403) responses instead of returningnullInAppMessagesManagercatches the exception, stores a pending retry state, and retries the fetch whenJwtTokenStorenotifies that the JWT has been refreshed for the same userCommit 5 -- Fix PR review issues: fallback 401 handling, volatile fields, TOCTOU race
fetchInAppMessagesWithoutRywToken(the fallback path after RYW retries are exhausted) was missing the 401/403 check, silently returningnulland bypassing the JWT retry mechanism@VolatiletopendingJwtRetryExternalIdandpendingJwtRetryRywDatafor cross-thread memory visibilityUserManagerbetweenfireJwtInvalidatedandaddJwtInvalidatedListener: without synchronization, a listener could subscribe between thehasSubscriberscheck and thependingJwtInvalidatedExternalIdstore, causing the event to be lost permanentlyTesting
Unit testing
LoginUserOperationExecutorTeststo include the newIConsistencyManagerdependency (all 16 constructor calls)InAppBackendServiceTests: 401 response throwsBackendExceptionInAppBackendServiceTests: 401 from fallback (no RYW token) path throwsBackendExceptionInAppMessagesManagerTestscovering the JWT 401 retry flow:Manual testing
updateUserJwtis calledAffected code checklist
Checklist
Overview
Testing
Final pass