Skip to content

Support delaying definition recompute to the background#1495

Merged
qtomlinson merged 44 commits intoclearlydefined:masterfrom
qtomlinson:qt/recompute-picked
Apr 4, 2026
Merged

Support delaying definition recompute to the background#1495
qtomlinson merged 44 commits intoclearlydefined:masterfrom
qtomlinson:qt/recompute-picked

Conversation

@qtomlinson
Copy link
Copy Markdown
Collaborator

@qtomlinson qtomlinson commented Mar 17, 2026

Based on Application Insights logging, "definition not available" events occur at 7K–10K per hour (data window: 12–18 hours). These definitions are currently computed on demand, increasing response time.

This PR adds a delayed compute path: when a definition is missing from the blob store, the service immediately returns an empty placeholder (a schema-valid definition with no tools), queues a background compute job, and triggers harvest if needed.

UpgradeHandler is extended into RecomputeHandler using a policy-composed architecture with two independent paths:

  • Upgrade path (upgradePolicy): evaluates existing definitions against schema rules and queues upgrade work when stale. Unchanged from existing DefinitionVersionCheck / DefinitionQueueUpgrader behavior. Queue errors are swallowed — returning a stale definition that will be retried on the next request is acceptable. This is the existing upgrade workflow.
  • Compute path (computePolicy): handles requests where no stored definition exists(MissingDefinitionComputePolicy)
    • OnDemandComputePolicy — computes synchronously (existing behavior, default).

    • DelayedComputePolicy — returns a placeholder immediately and enqueues background compute. Queue errors propagate to the caller — a missing definition with no queued work is worse than a failed request.

      Delayed queue processing uses missing-only semantics (SkipUpgradePolicy) so the background processor backfills missing definitions only, not stale ones. A shared in-memory cache is passed to both processing paths to prevent duplicate concurrent compute.

The provider is selected via DEFINITION_UPGRADE_PROVIDER: onDemand (default) or delayed. Backward-compatible aliases versionCheck and upgradeQueue are preserved.

Class names and environment variables related to upgrade are minimally changed and can be updated in a follow-up if needed.

Test Cases

In addition to the unit tests and integration tests added, the following end to end test cases were completed.

Step Test Result
Pass 0: Default file-based dev setup, on-demand compute (Regression)
0.1 Default dev: full harvest + recompute (no harvest data) PASS
0.2 Default dev: fast recompute (harvest data exists) PASS
0.3 Default dev: existing definition returned as-is PASS
0.4 Default dev: force recompute works PASS
0.5 Default dev: stale schema triggers blocking recompute PASS
Pass 1: Azurite on-demand compute (Regression)
1.1 On-demand: full harvest + recompute PASS
1.2 On-demand: fast recompute (harvest exists) PASS
1.3 On-demand: force recompute works PASS
Pass 2: Azurite upgrade-queue mode (Regression)
2.1 Queue mode: current definition returned as-is PASS
2.2 Queue mode: stale definition triggers upgrade queue PASS
2.3 Queue mode: upgrade processor recomputes PASS
2.4 Queue mode: stale def + queue unavailable → stale returned (not error) PASS
Pass 3: Azurite delayed compute path (new)
3.1 Delayed: missing definition returns placeholder + message queued PASS
3.2 Delayed: background processor computes real definition PASS
3.3 Delayed: full harvest + recompute via background PASS
3.4 Delayed: existing definition not re-queued PASS
3.5 Delayed: force=true bypasses delayed path PASS
3.6 Delayed: stale → upgrade queue, not compute queue PASS
Pass 4: Operational resilience (Azurite, new)
4.1 Operational: queue unavailable → request fails PASS
4.2 Operational: restart - queued message not lost PASS
4.3 Operational: duplicate requests - one compute runs PASS

@qtomlinson qtomlinson changed the title Qt/recompute picked Support delaying definition recompute in the background Mar 17, 2026
- Rename queueComputePolicy files/classes to delayedComputePolicy / DelayedComputePolicy
- Update imports, tests, and tsconfig includes to new names
- Add upgrade.service.delayed alias to delayed recompute factory
- Keep upgradeQueue for backward compatibility (deprecated in typings)
- Add existingDefinitionPolicy to treat existing definitions as valid on compute queue processing
- Wire delayed compute setupProcessing to use existingDefinitionPolicy
- Add tests for skip-on-existing and recompute-on-missing compute-queue behavior
- Include new policy files in tsconfig includes
- Create a shared in-memory lock cache in RecomputeHandler
- Pass shared cache to both upgrade and delayed compute setupProcessing paths
- Extend process.setup to accept optional sharedCache parameter
- Keep existingDefinitionPolicy for compute missing-only semantics
- Align JS and d.ts signatures for optional shared cache injection
@qtomlinson qtomlinson force-pushed the qt/recompute-picked branch from c0d5a98 to 195f946 Compare March 28, 2026 04:15
The previous name implied a domain concept about definition existence.
The class is a null-behavior upgrade validator that skips staleness
checks entirely — SkipUpgradePolicy better expresses that intent.

- Rename files and class in providers/upgrade/
- Update import in delayedComputePolicy
- Update tsconfig includes
defaultFactory only uses logger from its options object; any other
properties (e.g. queue passed from app wiring) were silently ignored.
Destructure { logger } in the parameter to make this explicit at the
call boundary, and narrow the type declaration accordingly.
…terface

The UpgradeHandler interface declares validate returning Promise<Definition | null>.
SkipUpgradePolicy was returning undefined on the falsy path instead of null,
causing a minor type mismatch. Changed return to null to align with the interface.
- Convert OnDemandComputePolicy from plain object to named class
  so getPolicyName() returns a meaningful name instead of "Object"
- Remove createOnDemandComputePolicy and createDelayedComputePolicy
  factory functions as they were thin wrappers over new
- Update recomputeHandler and tests to use constructors directly
The compute queue processor uses SkipUpgradePolicy, which ignores
definition contents entirely. Only coordinates are read from the
message, so the empty _meta field was dead weight.
Pass a plain DefinitionVersionChecker instead of relying on setup()'s
default parameter, making it clear the processor intentionally uses
a version checker (not the queue upgrader itself) to avoid re-queuing
stale definitions during background processing.
…andler

RecomputeHandler is the authoritative owner of the shared cache, so its TTL
constant belongs there. DefinitionUpgrader.defaultTtlSeconds made private
(_defaultTtlSeconds) as it is an internal fallback only. Removes the implicit
coupling where changing the upgrader TTL would silently affect the handler cache.
- DelayedComputePolicy constructor now requires options (removes misleading default stub)
- OnDemandComputePolicy.setupProcessing is now async (consistent with interface and sibling class)
- SkipUpgradePolicy drops currentSchema getter/setter (irrelevant since it skips staleness checks)
- Update .d.ts files to match
@qtomlinson qtomlinson changed the title Support delaying definition recompute in the background Support delaying definition recompute to the background Mar 31, 2026
- Resolve conflict in definitionServiceTest.ts: keep recomputeHandler
  with full mock (initialize, setupProcessing, compute) while adopting
  TS types from master's upgradeHandler refactor
- Add types/chai-as-promised.d.ts to fix rejectedWith TS type errors
  introduced by master's test-to-TypeScript migration
…ation

The Function type is required to match chai-as-promised's API signature
for error constructors passed to rejectedWith.
Replace verbose inline JSDoc typedef with a named interface in the
.d.ts file. The .js file now imports the type instead of redeclaring
it, along with DelayedFactoryOptions.
@types/chai-as-promised already provides sufficient types for
rejectedWith and fulfilled assertions.
Replace direct DefinitionVersionChecker/DefinitionQueueUpgrader
instantiation with defaultFactory/delayedFactory to test the real
production wiring. Move on-demand-specific tests out of the shared
helper since they don't apply to the queued path. Extract duplicated
passthrough recompute handler into createPassthroughRecomputeHandler.
Internal methods _queueCompute and _constructMessage should not be
part of the public type surface.
… Promise<Definition>

No implementation returns undefined — _cast() at the call site would crash if it did.
…tlSeconds from public API

sharedCache: The 'shared' qualifier is an implementation detail of
RecomputeHandler, which creates one cache and passes it to both policies to
coordinate locking. The receiving layers (setupProcessing, setup,
DefinitionUpgrader) don't care whether the cache is shared — cache is the
right name there.

defaultTtlSeconds: _defaultTtlSeconds is a private field in DefinitionUpgrader.
It was incorrectly declared as a public static in process.d.ts. Removed from
the type declaration; the default TTL remains in the implementation.
@qtomlinson qtomlinson force-pushed the qt/recompute-picked branch from 29d46d8 to b80f4c9 Compare April 1, 2026 23:17
@qtomlinson qtomlinson marked this pull request as ready for review April 1, 2026 23:20
@qtomlinson qtomlinson requested a review from JamieMagee April 1, 2026 23:21
Comment on lines 88 to 90
while (this._upgradeLock.get(coordinates.toString())) {
await new Promise(resolve => setTimeout(resolve, DefinitionUpgrader.delayInMSeconds))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spins at 500ms intervals with no upper bound. If a compute hangs or takes a long time for one coordinate, every queued message for that same coordinate backs up here and holds the processor. A max-retry count with a bail-out would keep a stuck compute from stalling the queue indefinitely. Even logging after N iterations would help with debugging.

Copy link
Copy Markdown
Collaborator Author

@qtomlinson qtomlinson Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The cache TTL for _upgradeLock is set to 5 minutes. This bounds any single wait and prevents stalling indefinitely.
  • However, if compute is fundamentally broken for a coordinate, each attempt gets a full 5-minute TTL window with no counter tracking how many times it has failed. Repeated failures across expiry cycles are unbounded. Adding logging after N expiry cycles, or a max-retry count to stop retrying a stuck coordinate, would address this.
    The existing on-demand compute path uses the same spin-lock pattern with the same 5-minute TTL bound. Neither path has compute-level retry logging or circuit-breaking for a coordinate that repeatedly fails. Adding logging after N expiry cycles would improve observability in both paths, and is a good follow-up improvement. Thoughts?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing the compute duration logs, the TTL has been updated from 5 minutes to 20 minutes to better reflect the actual compute window (p95 ~16min over 20 days). Logging after N expiry cycles remains a good follow-up improvement for observability.

dequeueOptions: {
numOfMessages:
config.get('DEFINITION_COMPUTE_DEQUEUE_BATCH_SIZE') || config.get('DEFINITION_UPGRADE_DEQUEUE_BATCH_SIZE') || 16,
visibilityTimeout: 10 * 60 // 10 min. The default value is 30 seconds.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compute queue sets visibilityTimeout: 10 * 60 (10 min), but the upgrade queue doesn't set one (Azure default is 30 seconds). That makes sense if compute is expected to be heavier, but if a compute actually takes longer than 10 minutes, the message becomes visible again and a second consumer could pick it up. The in-memory lock would prevent double-compute on the same instance, but not across multiple service instances. Is this a realistic scenario, or is compute expected to finish well under 10 minutes?

Copy link
Copy Markdown
Collaborator Author

@qtomlinson qtomlinson Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Both the upgrade and compute queues use the same visibilityTimeout of 10 minutes. The visibility timeout was initially set based on the harvest queue configuration, at providers/harvest/azureQueueConfig.js. After reviewing the logs, the actual compute times are higher than expected. Here are the compute and store duration stats:

7-day: p50 245ms, p85 ~1min, p95 ~11min
20-day: p50 245ms, p95 ~16min

The visibility timeout and related cache TTLs have been bumped to 20 minutes to cover the p95 compute window. Aligning the harvest queue visibility timeout is a good follow-up.

Comment on lines +183 to +191
"providers/upgrade/computePolicy.d.ts",
"providers/upgrade/skipUpgradePolicy.d.ts",
"providers/upgrade/skipUpgradePolicy.js",
"providers/upgrade/onDemandComputePolicy.d.ts",
"providers/upgrade/onDemandComputePolicy.js",
"providers/upgrade/delayedComputePolicy.d.ts",
"providers/upgrade/delayedComputePolicy.js",
"providers/upgrade/recomputeHandler.d.ts",
"providers/upgrade/recomputeHandler.js",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ Thank you!

Copy link
Copy Markdown
Contributor

@JamieMagee JamieMagee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upgrade/compute policy split is the right call here. I like that defaultFactory reproduces existing behavior without touching any of the new code paths, so delayed is purely opt-in.

I'd want the unbounded while(lock) spin in process.js addressed before merge. If a compute hangs, that loop stalls the queue processor forever. Also, the delayed path enqueues a new message for every request hitting the same missing coordinate. The consumer deduplicates, but at 7K-10K req/hr you're generating a lot of queue trash that gets dequeued, locked, checked, and thrown away. Even a quick computeLock.get() check before enqueuing would help.

Suppress duplicate queue messages for the same coordinate within a single
service instance. Before enqueuing, check an in-memory cache keyed by
coordinates.toString(); if already enqueued, skip and return the placeholder.

Cache is memory-only (not Redis): the goal is burst suppression within one
instance, not cross-instance dedup. Cross-instance correctness is already
handled by the consumer side — the shared Redis definition cache serializes
processing, and SkipUpgradePolicy skips recompute when a definition already
exists. Adding Redis here would introduce hot-path latency for a benefit
already provided elsewhere.

TTL is 5 minutes (matching the spin-lock TTL), well above the p95 compute
window (~6s). Cache key uses coordinates.toString() without toLowerCase() —
casing normalization is definitionService's concern only.

Cache is injectable via options.enqueueCache for testing.
The delayed compute path is meant to defer on-demand behavior to the
background, not introduce new retry semantics. The on-demand path computes
exactly once — if it fails, the next request retries. The delayed path
should mirror that: single attempt, demand-driven recovery.

Previously, a processing error left the message in the queue to be retried
up to 5 times (Azure visibility timeout). Now the message is always deleted
in a finally block. If no definition exists, the next HTTP request will
re-enqueue.
- Drop `void [...]` in defVersionCheck.setupProcessing; `_` prefix is sufficient
  to signal unused parameters
- Add @deprecated JSDoc + TODO to versionCheck and upgradeQueue legacy aliases
  in providers/index.d.ts
- Add comment in definitionService.js explaining force=true bypasses curations
  intentionally
- Add comment in recomputeHandler.setupProcessing clarifying it resolves after
  the first polling batch; queue consumers continue running in the background
The compute queue is expected to handle ~10K requests/hour when delayed
compute is enabled.

Azure Storage Queue caps getMessages at 32 per call, so there is no benefit
to making this configurable. Hardcoding to 32 maximizes throughput and
removes the accidental dependency on DEFINITION_UPGRADE_DEQUEUE_BATCH_SIZE,
which was inherited from the upgrade queue default and would have silently
under-provisioned the compute queue in any environment that sets that var.
Based on compute stats (p50: 245ms, p85: ~1min, p95: 16.35min over 20 days;
p95: 11min over 7 days), the previous 10-min visibilityTimeout and 5-min TTLs
were too short for the tail of the distribution.

- visibilityTimeout (both queues): 10 min → 20 min, covers p95 compute duration
- Spin lock TTL (DefinitionUpgrader): 5 min → 20 min, aligned with visibilityTimeout
  to prevent same-instance duplicate computes while a slow compute is still running
- Pre-enqueue cache TTL (DelayedComputePolicy): 5 min → 20 min, consistent with
  the compute window all three values are anchored to
@qtomlinson
Copy link
Copy Markdown
Collaborator Author

Follow-up: consolidate _upgradeLock into computeLock by refactoring computeAndStoreIfNecessary to accept a policy predicate.

@qtomlinson qtomlinson merged commit 17df9ab into clearlydefined:master Apr 4, 2026
4 checks passed
@qtomlinson qtomlinson deleted the qt/recompute-picked branch April 4, 2026 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants