Support delaying definition recompute to the background by qtomlinson · Pull Request #1495 · clearlydefined/service

qtomlinson · 2026-03-17T18:08:56Z

Based on Application Insights logging, "definition not available" events occur at 7K–10K per hour (data window: 12–18 hours). These definitions are currently computed on demand, increasing response time.

This PR adds a delayed compute path: when a definition is missing from the blob store, the service immediately returns an empty placeholder (a schema-valid definition with no tools), queues a background compute job, and triggers harvest if needed.

UpgradeHandler is extended into RecomputeHandler using a policy-composed architecture with two independent paths:

Upgrade path (upgradePolicy): evaluates existing definitions against schema rules and queues upgrade work when stale. Unchanged from existing DefinitionVersionCheck / DefinitionQueueUpgrader behavior. Queue errors are swallowed — returning a stale definition that will be retried on the next request is acceptable. This is the existing upgrade workflow.
Compute path (computePolicy): handles requests where no stored definition exists(MissingDefinitionComputePolicy)
- OnDemandComputePolicy — computes synchronously (existing behavior, default).
- DelayedComputePolicy — returns a placeholder immediately and enqueues background compute. Queue errors propagate to the caller — a missing definition with no queued work is worse than a failed request.
  
  Delayed queue processing uses missing-only semantics (SkipUpgradePolicy) so the background processor backfills missing definitions only, not stale ones. A shared in-memory cache is passed to both processing paths to prevent duplicate concurrent compute.

The provider is selected via DEFINITION_UPGRADE_PROVIDER: onDemand (default) or delayed. Backward-compatible aliases versionCheck and upgradeQueue are preserved.

Class names and environment variables related to upgrade are minimally changed and can be updated in a follow-up if needed.

Test Cases

In addition to the unit tests and integration tests added, the following end to end test cases were completed.

Step	Test	Result
	Pass 0: Default file-based dev setup, on-demand compute (Regression)
0.1	Default dev: full harvest + recompute (no harvest data)	PASS
0.2	Default dev: fast recompute (harvest data exists)	PASS
0.3	Default dev: existing definition returned as-is	PASS
0.4	Default dev: force recompute works	PASS
0.5	Default dev: stale schema triggers blocking recompute	PASS
	Pass 1: Azurite on-demand compute (Regression)
1.1	On-demand: full harvest + recompute	PASS
1.2	On-demand: fast recompute (harvest exists)	PASS
1.3	On-demand: force recompute works	PASS
	Pass 2: Azurite upgrade-queue mode (Regression)
2.1	Queue mode: current definition returned as-is	PASS
2.2	Queue mode: stale definition triggers upgrade queue	PASS
2.3	Queue mode: upgrade processor recomputes	PASS
2.4	Queue mode: stale def + queue unavailable → stale returned (not error)	PASS
	Pass 3: Azurite delayed compute path (new)
3.1	Delayed: missing definition returns placeholder + message queued	PASS
3.2	Delayed: background processor computes real definition	PASS
3.3	Delayed: full harvest + recompute via background	PASS
3.4	Delayed: existing definition not re-queued	PASS
3.5	Delayed: force=true bypasses delayed path	PASS
3.6	Delayed: stale → upgrade queue, not compute queue	PASS
	Pass 4: Operational resilience (Azurite, new)
4.1	Operational: queue unavailable → request fails	PASS
4.2	Operational: restart - queued message not lost	PASS
4.3	Operational: duplicate requests - one compute runs	PASS

… compute policy path

- Rename queueComputePolicy files/classes to delayedComputePolicy / DelayedComputePolicy - Update imports, tests, and tsconfig includes to new names - Add upgrade.service.delayed alias to delayed recompute factory - Keep upgradeQueue for backward compatibility (deprecated in typings)

- Add existingDefinitionPolicy to treat existing definitions as valid on compute queue processing - Wire delayed compute setupProcessing to use existingDefinitionPolicy - Add tests for skip-on-existing and recompute-on-missing compute-queue behavior - Include new policy files in tsconfig includes

- Create a shared in-memory lock cache in RecomputeHandler - Pass shared cache to both upgrade and delayed compute setupProcessing paths - Extend process.setup to accept optional sharedCache parameter - Keep existingDefinitionPolicy for compute missing-only semantics - Align JS and d.ts signatures for optional shared cache injection

The previous name implied a domain concept about definition existence. The class is a null-behavior upgrade validator that skips staleness checks entirely — SkipUpgradePolicy better expresses that intent. - Rename files and class in providers/upgrade/ - Update import in delayedComputePolicy - Update tsconfig includes

defaultFactory only uses logger from its options object; any other properties (e.g. queue passed from app wiring) were silently ignored. Destructure { logger } in the parameter to make this explicit at the call boundary, and narrow the type declaration accordingly.

…terface The UpgradeHandler interface declares validate returning Promise<Definition | null>. SkipUpgradePolicy was returning undefined on the falsy path instead of null, causing a minor type mismatch. Changed return to null to align with the interface.

- Convert OnDemandComputePolicy from plain object to named class so getPolicyName() returns a meaningful name instead of "Object" - Remove createOnDemandComputePolicy and createDelayedComputePolicy factory functions as they were thin wrappers over new - Update recomputeHandler and tests to use constructors directly

The compute queue processor uses SkipUpgradePolicy, which ignores definition contents entirely. Only coordinates are read from the message, so the empty _meta field was dead weight.

Pass a plain DefinitionVersionChecker instead of relying on setup()'s default parameter, making it clear the processor intentionally uses a version checker (not the queue upgrader itself) to avoid re-queuing stale definitions during background processing.

…andler RecomputeHandler is the authoritative owner of the shared cache, so its TTL constant belongs there. DefinitionUpgrader.defaultTtlSeconds made private (_defaultTtlSeconds) as it is an internal fallback only. Removes the implicit coupling where changing the upgrader TTL would silently affect the handler cache.

- DelayedComputePolicy constructor now requires options (removes misleading default stub) - OnDemandComputePolicy.setupProcessing is now async (consistent with interface and sibling class) - SkipUpgradePolicy drops currentSchema getter/setter (irrelevant since it skips staleness checks) - Update .d.ts files to match

- Resolve conflict in definitionServiceTest.ts: keep recomputeHandler with full mock (initialize, setupProcessing, compute) while adopting TS types from master's upgradeHandler refactor - Add types/chai-as-promised.d.ts to fix rejectedWith TS type errors introduced by master's test-to-TypeScript migration

…ation The Function type is required to match chai-as-promised's API signature for error constructors passed to rejectedWith.

Replace verbose inline JSDoc typedef with a named interface in the .d.ts file. The .js file now imports the type instead of redeclaring it, along with DelayedFactoryOptions.

@types/chai-as-promised already provides sufficient types for rejectedWith and fulfilled assertions.

Replace direct DefinitionVersionChecker/DefinitionQueueUpgrader instantiation with defaultFactory/delayedFactory to test the real production wiring. Move on-demand-specific tests out of the shared helper since they don't apply to the queued path. Extract duplicated passthrough recompute handler into createPassthroughRecomputeHandler.

Internal methods _queueCompute and _constructMessage should not be part of the public type surface.

… Promise<Definition> No implementation returns undefined — _cast() at the call site would crash if it did.

…tlSeconds from public API sharedCache: The 'shared' qualifier is an implementation detail of RecomputeHandler, which creates one cache and passes it to both policies to coordinate locking. The receiving layers (setupProcessing, setup, DefinitionUpgrader) don't care whether the cache is shared — cache is the right name there. defaultTtlSeconds: _defaultTtlSeconds is a private field in DefinitionUpgrader. It was incorrectly declared as a public static in process.d.ts. Removed from the type declaration; the default TTL remains in the implementation.

business/definitionService.js

JamieMagee · 2026-04-03T03:24:57Z

providers/upgrade/process.js

    while (this._upgradeLock.get(coordinates.toString())) {
      await new Promise(resolve => setTimeout(resolve, DefinitionUpgrader.delayInMSeconds))
    }


This spins at 500ms intervals with no upper bound. If a compute hangs or takes a long time for one coordinate, every queued message for that same coordinate backs up here and holds the processor. A max-retry count with a bail-out would keep a stuck compute from stalling the queue indefinitely. Even logging after N iterations would help with debugging.

The cache TTL for _upgradeLock is set to 5 minutes. This bounds any single wait and prevents stalling indefinitely.

However, if compute is fundamentally broken for a coordinate, each attempt gets a full 5-minute TTL window with no counter tracking how many times it has failed. Repeated failures across expiry cycles are unbounded. Adding logging after N expiry cycles, or a max-retry count to stop retrying a stuck coordinate, would address this.
The existing on-demand compute path uses the same spin-lock pattern with the same 5-minute TTL bound. Neither path has compute-level retry logging or circuit-breaking for a coordinate that repeatedly fails. Adding logging after N expiry cycles would improve observability in both paths, and is a good follow-up improvement. Thoughts?

After reviewing the compute duration logs, the TTL has been updated from 5 minutes to 20 minutes to better reflect the actual compute window (p95 ~16min over 20 days). Logging after N expiry cycles remains a good follow-up improvement for observability.

providers/upgrade/defVersionCheck.js

providers/upgrade/delayedComputePolicy.js

business/definitionService.js

JamieMagee · 2026-04-03T03:29:51Z

providers/upgrade/azureQueueConfig.js

+  dequeueOptions: {
+    numOfMessages:
+      config.get('DEFINITION_COMPUTE_DEQUEUE_BATCH_SIZE') || config.get('DEFINITION_UPGRADE_DEQUEUE_BATCH_SIZE') || 16,
+    visibilityTimeout: 10 * 60 // 10 min. The default value is 30 seconds.


The compute queue sets visibilityTimeout: 10 * 60 (10 min), but the upgrade queue doesn't set one (Azure default is 30 seconds). That makes sense if compute is expected to be heavier, but if a compute actually takes longer than 10 minutes, the message becomes visible again and a second consumer could pick it up. The in-memory lock would prevent double-compute on the same instance, but not across multiple service instances. Is this a realistic scenario, or is compute expected to finish well under 10 minutes?

Good question! Both the upgrade and compute queues use the same visibilityTimeout of 10 minutes. The visibility timeout was initially set based on the harvest queue configuration, at providers/harvest/azureQueueConfig.js. After reviewing the logs, the actual compute times are higher than expected. Here are the compute and store duration stats:

7-day: p50 245ms, p85 ~1min, p95 ~11min
20-day: p50 245ms, p95 ~16min

The visibility timeout and related cache TTLs have been bumped to 20 minutes to cover the p95 compute window. Aligning the harvest queue visibility timeout is a good follow-up.

providers/upgrade/recomputeHandler.js

JamieMagee · 2026-04-03T03:31:56Z

tsconfig.json

+    "providers/upgrade/computePolicy.d.ts",
+    "providers/upgrade/skipUpgradePolicy.d.ts",
+    "providers/upgrade/skipUpgradePolicy.js",
+    "providers/upgrade/onDemandComputePolicy.d.ts",
+    "providers/upgrade/onDemandComputePolicy.js",
+    "providers/upgrade/delayedComputePolicy.d.ts",
+    "providers/upgrade/delayedComputePolicy.js",
+    "providers/upgrade/recomputeHandler.d.ts",
+    "providers/upgrade/recomputeHandler.js",


❤️ Thank you!

providers/index.d.ts

JamieMagee

The upgrade/compute policy split is the right call here. I like that defaultFactory reproduces existing behavior without touching any of the new code paths, so delayed is purely opt-in.

I'd want the unbounded while(lock) spin in process.js addressed before merge. If a compute hangs, that loop stalls the queue processor forever. Also, the delayed path enqueues a new message for every request hitting the same missing coordinate. The consumer deduplicates, but at 7K-10K req/hr you're generating a lot of queue trash that gets dequeued, locked, checked, and thrown away. Even a quick computeLock.get() check before enqueuing would help.

Suppress duplicate queue messages for the same coordinate within a single service instance. Before enqueuing, check an in-memory cache keyed by coordinates.toString(); if already enqueued, skip and return the placeholder. Cache is memory-only (not Redis): the goal is burst suppression within one instance, not cross-instance dedup. Cross-instance correctness is already handled by the consumer side — the shared Redis definition cache serializes processing, and SkipUpgradePolicy skips recompute when a definition already exists. Adding Redis here would introduce hot-path latency for a benefit already provided elsewhere. TTL is 5 minutes (matching the spin-lock TTL), well above the p95 compute window (~6s). Cache key uses coordinates.toString() without toLowerCase() — casing normalization is definitionService's concern only. Cache is injectable via options.enqueueCache for testing.

The delayed compute path is meant to defer on-demand behavior to the background, not introduce new retry semantics. The on-demand path computes exactly once — if it fails, the next request retries. The delayed path should mirror that: single attempt, demand-driven recovery. Previously, a processing error left the message in the queue to be retried up to 5 times (Azure visibility timeout). Now the message is always deleted in a finally block. If no definition exists, the next HTTP request will re-enqueue.

@deprecated

- Drop `void [...]` in defVersionCheck.setupProcessing; `_` prefix is sufficient to signal unused parameters - Add @deprecated JSDoc + TODO to versionCheck and upgradeQueue legacy aliases in providers/index.d.ts - Add comment in definitionService.js explaining force=true bypasses curations intentionally - Add comment in recomputeHandler.setupProcessing clarifying it resolves after the first polling batch; queue consumers continue running in the background

The compute queue is expected to handle ~10K requests/hour when delayed compute is enabled. Azure Storage Queue caps getMessages at 32 per call, so there is no benefit to making this configurable. Hardcoding to 32 maximizes throughput and removes the accidental dependency on DEFINITION_UPGRADE_DEQUEUE_BATCH_SIZE, which was inherited from the upgrade queue default and would have silently under-provisioned the compute queue in any environment that sets that var.

Based on compute stats (p50: 245ms, p85: ~1min, p95: 16.35min over 20 days; p95: 11min over 7 days), the previous 10-min visibilityTimeout and 5-min TTLs were too short for the tail of the distribution. - visibilityTimeout (both queues): 10 min → 20 min, covers p95 compute duration - Spin lock TTL (DefinitionUpgrader): 5 min → 20 min, aligned with visibilityTimeout to prevent same-instance duplicate computes while a slow compute is still running - Pre-enqueue cache TTL (DelayedComputePolicy): 5 min → 20 min, consistent with the compute window all three values are anchored to

qtomlinson · 2026-04-04T22:43:07Z

Follow-up: consolidate _upgradeLock into computeLock by refactoring computeAndStoreIfNecessary to accept a policy predicate.

qtomlinson changed the title ~~Qt/recompute picked~~ Support delaying definition recompute in the background Mar 17, 2026

qtomlinson added 14 commits March 27, 2026 15:49

Introduce recompute handler contract and switch definition service to…

7f17c3d

… compute policy path

Refactor recompute policies and assert delayed payload

b482efd

Align recompute handler naming in app and service types

383aafe

Rename upgrade process checker to upgrade policy

ee42735

Minor refactor

a54b22b

Refactor delayed recompute queue config and typings

286090b

Refine recompute context and empty definition handling

0100d7a

Resolve rebase conflicts for recompute handler integration

9544b4c

Simplify existing definition policy creation

683cb87

Add RecomputeHandler lifecycle and delegation tests

936f20a

Improve recompute handler observability and simplify tests

195f946

qtomlinson force-pushed the qt/recompute-picked branch from c0d5a98 to 195f946 Compare March 28, 2026 04:15

qtomlinson added 11 commits March 28, 2026 11:46

Improve error message for missing initialize() call

7255fd4

Remove logger mutation side effect from setupProcessing

1ececc4

Throw on enqueue failure and queue before returning placeholder

12478a6

Simplify DelayedComputePolicy message to coordinates-only

45906eb

The compute queue processor uses SkipUpgradePolicy, which ignores definition contents entirely. Only coordinates are read from the message, so the empty _meta field was dead weight.

qtomlinson changed the title ~~Support delaying definition recompute in the background~~ Support delaying definition recompute to the background Mar 31, 2026

qtomlinson added 2 commits March 31, 2026 15:01

Suppress biome noBannedTypes warning in chai-as-promised type augment…

c5d44f8

…ation The Function type is required to match chai-as-promised's API signature for error constructors passed to rejectedWith.

qtomlinson added 6 commits April 1, 2026 13:52

Add UpgradePolicy interface to recomputeHandler.d.ts

07ac416

Replace verbose inline JSDoc typedef with a named interface in the .d.ts file. The .js file now imports the type instead of redeclaring it, along with DelayedFactoryOptions.

Remove redundant local chai-as-promised type augmentation

8e42311

@types/chai-as-promised already provides sufficient types for rejectedWith and fulfilled assertions.

Remove protected methods from DelayedComputePolicy type declaration

2309a88

Internal methods _queueCompute and _constructMessage should not be part of the public type surface.

Tighten compute() return type from Promise<Definition | undefined> to…

e1aed04

… Promise<Definition> No implementation returns undefined — _cast() at the call site would crash if it did.

qtomlinson force-pushed the qt/recompute-picked branch from 29d46d8 to b80f4c9 Compare April 1, 2026 23:17

qtomlinson marked this pull request as ready for review April 1, 2026 23:20

qtomlinson requested a review from JamieMagee April 1, 2026 23:21