-
Notifications
You must be signed in to change notification settings - Fork 521
Open
Description
Intermittent Chronos Issue:
At our Chronos cluster, we have been encountering an intermittent issue where Chronos jobs stop getting executed on Mesos. The sequence of observed events is as follows:
- Chronos jobs are not executed by Mesos.
- Status of jobs on Chronos dashboard is ‘Queued’.
- Mesos master logs show that
-- master has not been sending resource offers to framework i.e. Chronos.
-- master keeps getting update from slaves for old tasks.
-- it keeps trying to forward the update to chronos.
-- Zookeeper and slaves are not down. They are working fine. - After restarting Chronos and Zookeeper, the system starts working fine. Chronos jobs start getting executed.
Whys:
- Why Chronos jobs stop getting executed ?
Chronos, as a Mesos application (framework), waits for resource offers from Mesos master.
Mesos master generally sends resource offers at a very high frequency i.e. 100 ms to a few seconds. However, in this case, the master stopped sending resource offers. Without these resource offers, Chronos is stuck. - Why Mesos master stopped sending resource offers ?
The mesos slaves were occupied with FINISHED tasks. Mesos slaves were telling the master that taks is FINISHED and the master was trying to tell Chronos leader the same and waiting for ACK. Chronos was not sending ACK. - Why did Chronos not send ACK ?
The "JobScheduler::handleFinishedTask" thread in Chronos leader was waiting on ReentrantLock which was held by the "JobScheduler::mainLoop" thread. - Why did "JobScheduler::mainLoop" thread not release the lock ?
The mainLoop thread is trying to reload jobs from ZK and it is blocked on ZK.
Software Versions:
- Chronos 3.0.3
- Mesos 1.4.0
- Zookeeper 3.4.5
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels