1. Introduction: The Scalability Wall of Cluster-Wide Locking
Currently, Quartz relies on a pessimistic cluster-wide locking strategy via the QRTZ_LOCKS table (specifically the TRIGGER_ACCESS lock). While this ensures high data integrity, it creates a significant scalability bottleneck in clustered environments or high-throughput instances processing thousands of jobs per minute.
The official Quartz documentation acknowledges this architectural "speed limit" in the Quartz FAQ:
"...The clustering feature works best for scaling out long-running and/or cpu-intensive jobs (distributing the work-load over multiple nodes). If you need to scale out to support thousands of short-running (e.g 1 second) jobs, consider partitioning the set of jobs by using multiple distinct schedulers. Using one scheduler forces the use of a cluster-wide lock, a pattern that degrades performance as you add more clients."
The Problem:
-
Sequential Acquisition: Even with multiple nodes, the cluster-wide lock acts as a global mutex. Only one node can "acquire" jobs at a time; others sit idle in a database "wait" state.
-
Completion Bottleneck: Since the completion phase is thread-per-job, hundreds of worker threads across the cluster compete for the same single row in QRTZ_LOCKS to update trigger states.
-
Payload Tax: Jobs with large JobDataMaps (property bags) increase the transaction hold time, exponentially increasing the wait-queue for the cluster-wide lock.
2. Conceptual Proposal: Row-Based Concurrency
We propose an optional high-performance locking mode that moves synchronization from a Cluster-Wide Mutex to Fine-Grained Row-Level Locking on the QRTZ_TRIGGERS table.
By utilizing modern SQL features like FOR UPDATE SKIP LOCKED (available in PostgreSQL 9.5+, Oracle 11g+, MySQL 8.0+, and MS SQL Server), Quartz can achieve true horizontal scalability:
-
Parallel Acquisition: Multiple nodes can query the triggers table simultaneously. Node A grabs a batch; Node B skips those locked rows and immediately grabs the next available batch without blocking.
-
Parallel Completion: Completion logic would lock only the specific row being updated, allowing hundreds of threads to commit state changes simultaneously.
3. Technical Example: Proposed SQL Implementation
The DriverDelegate would be updated to support a "Row-Locking" mode. Below is the conceptual SQL logic:
A. High-Concurrency Acquisition
Instead of locking a global semaphore, the acquisition query targets rows directly:
SQL
-- Acquisition with Skip Locked
SELECT TRIGGER_NAME, TRIGGER_GROUP, NEXT_FIRE_TIME, PRIORITY
FROM QRTZ_TRIGGERS
WHERE SCHED_NAME = 'MyScheduler'
AND TRIGGER_STATE = 'WAITING'
AND NEXT_FIRE_TIME <= :now
ORDER BY NEXT_FIRE_TIME ASC, PRIORITY DESC
FOR UPDATE SKIP LOCKED
LIMIT :batchSize;
B. Row-Level State Transition (Completion)
During completion, we avoid the cluster-wide lock and target the specific trigger row to ensure atomicity.
Step 1: Lock the specific trigger row
SQL
-- Explicitly lock only the specific row being finalized
SELECT TRIGGER_STATE
FROM QRTZ_TRIGGERS
WHERE SCHED_NAME = 'MyScheduler'
AND TRIGGER_NAME = :name
AND TRIGGER_GROUP = :group
FOR UPDATE;
Step 2: Update the state
SQL
-- Completion update using the held row-level lock
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE = 'WAITING',
NEXT_FIRE_TIME = :nextTime,
PREVIOUS_FIRE_TIME = :lastTime
WHERE SCHED_NAME = 'MyScheduler'
AND TRIGGER_NAME = :name
AND TRIGGER_GROUP = :group;
4. Suggested Configuration
We suggest a new property to enable this behavior: org.quartz.jobStore.useRowLevelLocking = true
This feature would allow a single Quartz scheduler instance to handle the "thousands of short-running jobs" use case without requiring the complex partitioning currently recommended in the documentation.
1. Introduction: The Scalability Wall of Cluster-Wide Locking
Currently, Quartz relies on a pessimistic cluster-wide locking strategy via the
QRTZ_LOCKStable (specifically theTRIGGER_ACCESSlock). While this ensures high data integrity, it creates a significant scalability bottleneck in clustered environments or high-throughput instances processing thousands of jobs per minute.The official Quartz documentation acknowledges this architectural "speed limit" in the Quartz FAQ:
The Problem:
Sequential Acquisition: Even with multiple nodes, the cluster-wide lock acts as a global mutex. Only one node can "acquire" jobs at a time; others sit idle in a database "wait" state.
Completion Bottleneck: Since the completion phase is thread-per-job, hundreds of worker threads across the cluster compete for the same single row in
QRTZ_LOCKSto update trigger states.Payload Tax: Jobs with large
JobDataMaps(property bags) increase the transaction hold time, exponentially increasing the wait-queue for the cluster-wide lock.2. Conceptual Proposal: Row-Based Concurrency
We propose an optional high-performance locking mode that moves synchronization from a Cluster-Wide Mutex to Fine-Grained Row-Level Locking on the
QRTZ_TRIGGERStable.By utilizing modern SQL features like
FOR UPDATE SKIP LOCKED(available in PostgreSQL 9.5+, Oracle 11g+, MySQL 8.0+, and MS SQL Server), Quartz can achieve true horizontal scalability:Parallel Acquisition: Multiple nodes can query the triggers table simultaneously. Node A grabs a batch; Node B skips those locked rows and immediately grabs the next available batch without blocking.
Parallel Completion: Completion logic would lock only the specific row being updated, allowing hundreds of threads to commit state changes simultaneously.
3. Technical Example: Proposed SQL Implementation
The
DriverDelegatewould be updated to support a "Row-Locking" mode. Below is the conceptual SQL logic:A. High-Concurrency Acquisition
Instead of locking a global semaphore, the acquisition query targets rows directly:
SQL
B. Row-Level State Transition (Completion)
During completion, we avoid the cluster-wide lock and target the specific trigger row to ensure atomicity.
Step 1: Lock the specific trigger row
SQL
Step 2: Update the state
SQL
4. Suggested Configuration
We suggest a new property to enable this behavior:
org.quartz.jobStore.useRowLevelLocking = trueThis feature would allow a single Quartz scheduler instance to handle the "thousands of short-running jobs" use case without requiring the complex partitioning currently recommended in the documentation.