Support delaying definition recompute to the background#1495
Support delaying definition recompute to the background#1495qtomlinson merged 44 commits intoclearlydefined:masterfrom
Conversation
… compute policy path
- Rename queueComputePolicy files/classes to delayedComputePolicy / DelayedComputePolicy - Update imports, tests, and tsconfig includes to new names - Add upgrade.service.delayed alias to delayed recompute factory - Keep upgradeQueue for backward compatibility (deprecated in typings)
- Add existingDefinitionPolicy to treat existing definitions as valid on compute queue processing - Wire delayed compute setupProcessing to use existingDefinitionPolicy - Add tests for skip-on-existing and recompute-on-missing compute-queue behavior - Include new policy files in tsconfig includes
- Create a shared in-memory lock cache in RecomputeHandler - Pass shared cache to both upgrade and delayed compute setupProcessing paths - Extend process.setup to accept optional sharedCache parameter - Keep existingDefinitionPolicy for compute missing-only semantics - Align JS and d.ts signatures for optional shared cache injection
c0d5a98 to
195f946
Compare
The previous name implied a domain concept about definition existence. The class is a null-behavior upgrade validator that skips staleness checks entirely — SkipUpgradePolicy better expresses that intent. - Rename files and class in providers/upgrade/ - Update import in delayedComputePolicy - Update tsconfig includes
defaultFactory only uses logger from its options object; any other
properties (e.g. queue passed from app wiring) were silently ignored.
Destructure { logger } in the parameter to make this explicit at the
call boundary, and narrow the type declaration accordingly.
…terface The UpgradeHandler interface declares validate returning Promise<Definition | null>. SkipUpgradePolicy was returning undefined on the falsy path instead of null, causing a minor type mismatch. Changed return to null to align with the interface.
- Convert OnDemandComputePolicy from plain object to named class so getPolicyName() returns a meaningful name instead of "Object" - Remove createOnDemandComputePolicy and createDelayedComputePolicy factory functions as they were thin wrappers over new - Update recomputeHandler and tests to use constructors directly
The compute queue processor uses SkipUpgradePolicy, which ignores definition contents entirely. Only coordinates are read from the message, so the empty _meta field was dead weight.
Pass a plain DefinitionVersionChecker instead of relying on setup()'s default parameter, making it clear the processor intentionally uses a version checker (not the queue upgrader itself) to avoid re-queuing stale definitions during background processing.
…andler RecomputeHandler is the authoritative owner of the shared cache, so its TTL constant belongs there. DefinitionUpgrader.defaultTtlSeconds made private (_defaultTtlSeconds) as it is an internal fallback only. Removes the implicit coupling where changing the upgrader TTL would silently affect the handler cache.
- DelayedComputePolicy constructor now requires options (removes misleading default stub) - OnDemandComputePolicy.setupProcessing is now async (consistent with interface and sibling class) - SkipUpgradePolicy drops currentSchema getter/setter (irrelevant since it skips staleness checks) - Update .d.ts files to match
- Resolve conflict in definitionServiceTest.ts: keep recomputeHandler with full mock (initialize, setupProcessing, compute) while adopting TS types from master's upgradeHandler refactor - Add types/chai-as-promised.d.ts to fix rejectedWith TS type errors introduced by master's test-to-TypeScript migration
…ation The Function type is required to match chai-as-promised's API signature for error constructors passed to rejectedWith.
Replace verbose inline JSDoc typedef with a named interface in the .d.ts file. The .js file now imports the type instead of redeclaring it, along with DelayedFactoryOptions.
@types/chai-as-promised already provides sufficient types for rejectedWith and fulfilled assertions.
Replace direct DefinitionVersionChecker/DefinitionQueueUpgrader instantiation with defaultFactory/delayedFactory to test the real production wiring. Move on-demand-specific tests out of the shared helper since they don't apply to the queued path. Extract duplicated passthrough recompute handler into createPassthroughRecomputeHandler.
Internal methods _queueCompute and _constructMessage should not be part of the public type surface.
… Promise<Definition> No implementation returns undefined — _cast() at the call site would crash if it did.
…tlSeconds from public API sharedCache: The 'shared' qualifier is an implementation detail of RecomputeHandler, which creates one cache and passes it to both policies to coordinate locking. The receiving layers (setupProcessing, setup, DefinitionUpgrader) don't care whether the cache is shared — cache is the right name there. defaultTtlSeconds: _defaultTtlSeconds is a private field in DefinitionUpgrader. It was incorrectly declared as a public static in process.d.ts. Removed from the type declaration; the default TTL remains in the implementation.
29d46d8 to
b80f4c9
Compare
| while (this._upgradeLock.get(coordinates.toString())) { | ||
| await new Promise(resolve => setTimeout(resolve, DefinitionUpgrader.delayInMSeconds)) | ||
| } |
There was a problem hiding this comment.
This spins at 500ms intervals with no upper bound. If a compute hangs or takes a long time for one coordinate, every queued message for that same coordinate backs up here and holds the processor. A max-retry count with a bail-out would keep a stuck compute from stalling the queue indefinitely. Even logging after N iterations would help with debugging.
There was a problem hiding this comment.
- The cache TTL for _upgradeLock is set to 5 minutes. This bounds any single wait and prevents stalling indefinitely.
- However, if compute is fundamentally broken for a coordinate, each attempt gets a full 5-minute TTL window with no counter tracking how many times it has failed. Repeated failures across expiry cycles are unbounded. Adding logging after N expiry cycles, or a max-retry count to stop retrying a stuck coordinate, would address this.
The existing on-demand compute path uses the same spin-lock pattern with the same 5-minute TTL bound. Neither path has compute-level retry logging or circuit-breaking for a coordinate that repeatedly fails. Adding logging after N expiry cycles would improve observability in both paths, and is a good follow-up improvement. Thoughts?
There was a problem hiding this comment.
After reviewing the compute duration logs, the TTL has been updated from 5 minutes to 20 minutes to better reflect the actual compute window (p95 ~16min over 20 days). Logging after N expiry cycles remains a good follow-up improvement for observability.
| dequeueOptions: { | ||
| numOfMessages: | ||
| config.get('DEFINITION_COMPUTE_DEQUEUE_BATCH_SIZE') || config.get('DEFINITION_UPGRADE_DEQUEUE_BATCH_SIZE') || 16, | ||
| visibilityTimeout: 10 * 60 // 10 min. The default value is 30 seconds. |
There was a problem hiding this comment.
The compute queue sets visibilityTimeout: 10 * 60 (10 min), but the upgrade queue doesn't set one (Azure default is 30 seconds). That makes sense if compute is expected to be heavier, but if a compute actually takes longer than 10 minutes, the message becomes visible again and a second consumer could pick it up. The in-memory lock would prevent double-compute on the same instance, but not across multiple service instances. Is this a realistic scenario, or is compute expected to finish well under 10 minutes?
There was a problem hiding this comment.
Good question! Both the upgrade and compute queues use the same visibilityTimeout of 10 minutes. The visibility timeout was initially set based on the harvest queue configuration, at providers/harvest/azureQueueConfig.js. After reviewing the logs, the actual compute times are higher than expected. Here are the compute and store duration stats:
7-day: p50 245ms, p85 ~1min, p95 ~11min
20-day: p50 245ms, p95 ~16min
The visibility timeout and related cache TTLs have been bumped to 20 minutes to cover the p95 compute window. Aligning the harvest queue visibility timeout is a good follow-up.
| "providers/upgrade/computePolicy.d.ts", | ||
| "providers/upgrade/skipUpgradePolicy.d.ts", | ||
| "providers/upgrade/skipUpgradePolicy.js", | ||
| "providers/upgrade/onDemandComputePolicy.d.ts", | ||
| "providers/upgrade/onDemandComputePolicy.js", | ||
| "providers/upgrade/delayedComputePolicy.d.ts", | ||
| "providers/upgrade/delayedComputePolicy.js", | ||
| "providers/upgrade/recomputeHandler.d.ts", | ||
| "providers/upgrade/recomputeHandler.js", |
JamieMagee
left a comment
There was a problem hiding this comment.
The upgrade/compute policy split is the right call here. I like that defaultFactory reproduces existing behavior without touching any of the new code paths, so delayed is purely opt-in.
I'd want the unbounded while(lock) spin in process.js addressed before merge. If a compute hangs, that loop stalls the queue processor forever. Also, the delayed path enqueues a new message for every request hitting the same missing coordinate. The consumer deduplicates, but at 7K-10K req/hr you're generating a lot of queue trash that gets dequeued, locked, checked, and thrown away. Even a quick computeLock.get() check before enqueuing would help.
Suppress duplicate queue messages for the same coordinate within a single service instance. Before enqueuing, check an in-memory cache keyed by coordinates.toString(); if already enqueued, skip and return the placeholder. Cache is memory-only (not Redis): the goal is burst suppression within one instance, not cross-instance dedup. Cross-instance correctness is already handled by the consumer side — the shared Redis definition cache serializes processing, and SkipUpgradePolicy skips recompute when a definition already exists. Adding Redis here would introduce hot-path latency for a benefit already provided elsewhere. TTL is 5 minutes (matching the spin-lock TTL), well above the p95 compute window (~6s). Cache key uses coordinates.toString() without toLowerCase() — casing normalization is definitionService's concern only. Cache is injectable via options.enqueueCache for testing.
The delayed compute path is meant to defer on-demand behavior to the background, not introduce new retry semantics. The on-demand path computes exactly once — if it fails, the next request retries. The delayed path should mirror that: single attempt, demand-driven recovery. Previously, a processing error left the message in the queue to be retried up to 5 times (Azure visibility timeout). Now the message is always deleted in a finally block. If no definition exists, the next HTTP request will re-enqueue.
- Drop `void [...]` in defVersionCheck.setupProcessing; `_` prefix is sufficient to signal unused parameters - Add @deprecated JSDoc + TODO to versionCheck and upgradeQueue legacy aliases in providers/index.d.ts - Add comment in definitionService.js explaining force=true bypasses curations intentionally - Add comment in recomputeHandler.setupProcessing clarifying it resolves after the first polling batch; queue consumers continue running in the background
The compute queue is expected to handle ~10K requests/hour when delayed compute is enabled. Azure Storage Queue caps getMessages at 32 per call, so there is no benefit to making this configurable. Hardcoding to 32 maximizes throughput and removes the accidental dependency on DEFINITION_UPGRADE_DEQUEUE_BATCH_SIZE, which was inherited from the upgrade queue default and would have silently under-provisioned the compute queue in any environment that sets that var.
Based on compute stats (p50: 245ms, p85: ~1min, p95: 16.35min over 20 days; p95: 11min over 7 days), the previous 10-min visibilityTimeout and 5-min TTLs were too short for the tail of the distribution. - visibilityTimeout (both queues): 10 min → 20 min, covers p95 compute duration - Spin lock TTL (DefinitionUpgrader): 5 min → 20 min, aligned with visibilityTimeout to prevent same-instance duplicate computes while a slow compute is still running - Pre-enqueue cache TTL (DelayedComputePolicy): 5 min → 20 min, consistent with the compute window all three values are anchored to
|
Follow-up: consolidate _upgradeLock into computeLock by refactoring computeAndStoreIfNecessary to accept a policy predicate. |
Based on Application Insights logging, "definition not available" events occur at 7K–10K per hour (data window: 12–18 hours). These definitions are currently computed on demand, increasing response time.
This PR adds a delayed compute path: when a definition is missing from the blob store, the service immediately returns an empty placeholder (a schema-valid definition with no tools), queues a background compute job, and triggers harvest if needed.
UpgradeHandleris extended intoRecomputeHandlerusing a policy-composed architecture with two independent paths:upgradePolicy): evaluates existing definitions against schema rules and queues upgrade work when stale. Unchanged from existingDefinitionVersionCheck/DefinitionQueueUpgraderbehavior. Queue errors are swallowed — returning a stale definition that will be retried on the next request is acceptable. This is the existing upgrade workflow.computePolicy): handles requests where no stored definition exists(MissingDefinitionComputePolicy)OnDemandComputePolicy— computes synchronously (existing behavior, default).DelayedComputePolicy— returns a placeholder immediately and enqueues background compute. Queue errors propagate to the caller — a missing definition with no queued work is worse than a failed request.Delayed queue processing uses missing-only semantics (
SkipUpgradePolicy) so the background processor backfills missing definitions only, not stale ones. A shared in-memory cache is passed to both processing paths to prevent duplicate concurrent compute.The provider is selected via
DEFINITION_UPGRADE_PROVIDER:onDemand(default) ordelayed. Backward-compatible aliasesversionCheckandupgradeQueueare preserved.Class names and environment variables related to upgrade are minimally changed and can be updated in a follow-up if needed.
Test Cases
In addition to the unit tests and integration tests added, the following end to end test cases were completed.