Skip to content

Chronos Intermittent Issue: Jobs get stuck #897

@harjinder-flipkart

Description

@harjinder-flipkart

Intermittent Chronos Issue:
At our Chronos cluster, we have been encountering an intermittent issue where Chronos jobs stop getting executed on Mesos. The sequence of observed events is as follows:

  • Chronos jobs are not executed by Mesos.
  • Status of jobs on Chronos dashboard is ‘Queued’.
  • Mesos master logs show that
    -- master has not been sending resource offers to framework i.e. Chronos.
    -- master keeps getting update from slaves for old tasks.
    -- it keeps trying to forward the update to chronos.
    -- Zookeeper and slaves are not down. They are working fine.
  • After restarting Chronos and Zookeeper, the system starts working fine. Chronos jobs start getting executed.

Whys:

  • Why Chronos jobs stop getting executed ?
    Chronos, as a Mesos application (framework), waits for resource offers from Mesos master.
    Mesos master generally sends resource offers at a very high frequency i.e. 100 ms to a few seconds. However, in this case, the master stopped sending resource offers. Without these resource offers, Chronos is stuck.
  • Why Mesos master stopped sending resource offers ?
    The mesos slaves were occupied with FINISHED tasks. Mesos slaves were telling the master that taks is FINISHED and the master was trying to tell Chronos leader the same and waiting for ACK. Chronos was not sending ACK.
  • Why did Chronos not send ACK ?
    The "JobScheduler::handleFinishedTask" thread in Chronos leader was waiting on ReentrantLock which was held by the "JobScheduler::mainLoop" thread.
  • Why did "JobScheduler::mainLoop" thread not release the lock ?
    The mainLoop thread is trying to reload jobs from ZK and it is blocked on ZK.

Software Versions:

  • Chronos 3.0.3
  • Mesos 1.4.0
  • Zookeeper 3.4.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions