Skip to content

Job Launcher and Job Handle: design doc/implementation/unit tests (TA: NVFlare developers)#4336

Open
IsaacYangSLA wants to merge 1 commit intoNVIDIA:mainfrom
IsaacYangSLA:add_k8s_job_launcher_and_design_docs
Open

Job Launcher and Job Handle: design doc/implementation/unit tests (TA: NVFlare developers)#4336
IsaacYangSLA wants to merge 1 commit intoNVIDIA:mainfrom
IsaacYangSLA:add_k8s_job_launcher_and_design_docs

Conversation

@IsaacYangSLA
Copy link
Collaborator

Description

Job Launcher and Job Handle design docs
Implementation of Job Launcher and Job Handle for kubernetes (docker is WIP)
Unit tests for implementation

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings March 19, 2026 00:04
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 19, 2026

Greptile Summary

This PR introduces the JobLauncher and JobHandle abstractions for NVFlare's Kubernetes (and in-progress Docker) backends, along with a Best-Effort resource manager/consumer, a configurable ignore_host mode for the GPU resource manager, and a parameterised cell-creation timeout in the client communicator. The design follows an event-driven selection model where the upper engine layers program exclusively against JobLauncherSpec/JobHandleSpec interfaces.

Key changes:

  • K8sJobHandle / K8sJobLauncher: Full Kubernetes pod lifecycle management including manifest construction, stuck-detection with configurable grace period, and terminal state persistence. Most issues surfaced in earlier review rounds (None args in pod spec, unset terminal_state after deletion, timeout=None disabling stuck detection, wait() not caching final state) have been addressed in this revision.
  • DockerJobHandle / DockerJobLauncher: The UnboundLocalError caused by handle being referenced before assignment in except blocks is fixed — handle = DockerJobHandle() is now created before the try block.
  • BEResourceManager / BEResourceConsumer: New "best-effort" implementations that accept all resource requests and defer failure to runtime.
  • GPUResourceManager: New ignore_host flag skips local GPU validation for Kubernetes deployments where the NVFlare control-plane node has no GPUs.
  • communicator.py: Replaces a hardcoded 15-second cell-creation timeout with a configurable cell_creation_timeout parameter.
  • One outstanding P0 bug: In K8sJobLauncher.launch_job(), the except ApiException handler logs self.job_id (line 314), but K8sJobLauncher has no such attribute — the correct reference is the local variable job_id. This will raise AttributeError on any pod-creation failure, masking the real error.

Confidence Score: 3/5

  • Not safe to merge as-is — there is an AttributeError in the K8sJobLauncher error path that will suppress the real exception on any pod-creation failure.
  • The majority of previously-identified issues have been resolved in this revision and the test suite is thorough. However, a new P0 bug was introduced: self.job_id in the except ApiException handler of K8sJobLauncher.launch_job() will raise AttributeError at runtime and swallow the real API error. This single bug blocks safe merging.
  • nvflare/app_opt/job_launcher/k8s_launcher.py line 314 — self.job_id must be replaced with the local variable job_id.

Important Files Changed

Filename Overview
nvflare/app_opt/job_launcher/k8s_launcher.py Core Kubernetes launcher implementation; contains a P0 AttributeError in the exception handler (self.job_id should be job_id). Most previously-reported issues (None args, terminal_state persistence, stuck detection with timeout=None, dead guard) are resolved in this revision.
nvflare/app_opt/job_launcher/docker_launcher.py Docker launcher implementation; handle = DockerJobHandle() is now correctly initialised before the try block, fixing the previously-reported UnboundLocalError. Logic looks correct.
tests/unit_test/app_opt/job_launcher/k8s_launcher_test.py Comprehensive unit tests for K8sJobHandle and launcher init/event handling; covers stuck detection, terminal state persistence, None module args, and terminate edge cases. The launch_job() error path (ApiException) is not exercised, which is why the self.job_id bug on line 314 of the launcher was not caught.
tests/unit_test/app_opt/job_launcher/docker_launcher_test.py Thorough tests for DockerJobHandle and DockerJobLauncher covering all state mappings, poll/wait/terminate lifecycle, and error paths. All previously-noted edge cases are now covered correctly.
nvflare/app_common/resource_managers/gpu_resource_manager.py Adds ignore_host flag that skips local GPU validation for K8s deployments. Logic is clean and well-guarded.
nvflare/private/fed/client/communicator.py Replaces hardcoded 15.0 cell-creation timeout with a configurable cell_creation_timeout parameter; increases polling sleep from 0.5s to 1s. Straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant E as ServerEngine / ClientExecutor
    participant EV as Event Bus
    participant L as K8sJobLauncher
    participant H as K8sJobHandle
    participant K8s as Kubernetes API

    E->>EV: fire BEFORE_JOB_LAUNCH
    EV->>L: handle_event()
    L->>EV: add_launcher(self, fl_ctx)

    E->>L: launch_job(job_meta, fl_ctx)
    L->>H: K8sJobHandle(job_id, api, job_config)
    H->>H: _make_manifest()
    L->>K8s: create_namespaced_pod(manifest)
    L->>H: enter_states([RUNNING], timeout)

    loop Poll until RUNNING or timeout/stuck
        H->>K8s: read_namespaced_pod()
        K8s-->>H: pod phase
        alt phase == Pending (too long)
            H->>H: _stuck() → True
            H->>K8s: delete_namespaced_pod()
            H-->>L: return False
        else phase == Running
            H-->>L: return True
        end
    end

    L-->>E: job_handle

    E->>H: wait()
    loop Until terminal state
        H->>K8s: read_namespaced_pod()
        K8s-->>H: Succeeded / Failed
        H->>H: terminal_state = SUCCEEDED/TERMINATED
    end

    E->>H: poll()
    H-->>E: JobReturnCode (from terminal_state)

    alt Abort requested
        E->>H: terminate()
        H->>K8s: delete_namespaced_pod(grace=0)
        H->>H: terminal_state = TERMINATED
    end
Loading

Comments Outside Diff (1)

  1. nvflare/app_opt/job_launcher/k8s_launcher.py, line 314 (link)

    self.job_id does not exist on K8sJobLauncher

    K8sJobLauncher.__init__ never assigns self.job_id — the class stores self.workspace_pvc, self.etc_pvc, self.data_pvc_file_path, self.timeout, self.namespace, and self.job_handle, but no job_id. Whenever create_namespaced_pod raises an ApiException, this error-logging line will itself raise AttributeError: 'K8sJobLauncher' object has no attribute 'job_id', hiding the real exception entirely.

    The intended reference is the local variable job_id defined two lines above:

Last reviewed commit: "Design document for ..."

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds documentation, Kubernetes-oriented job launching support, and unit tests around the new JobLauncher/JobHandle abstractions, plus “best effort” resource management utilities intended for orchestration backends.

Changes:

  • Add detailed design documentation for JobLauncherSpec/JobHandleSpec and the Process/Docker/K8s implementations.
  • Extend the Kubernetes launcher/handle to build PVC-backed pod manifests, support GPU limits, and add timeout/stuck-handling logic.
  • Introduce BE (best-effort) resource manager/consumer utilities and expand GPUResourceManager with an ignore_host option.
  • Add unit tests for K8s and Docker job launchers/handles.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
docs/design/JobLauncher_and_JobHandle.md New design doc describing the launcher/handle architecture and backends.
nvflare/apis/job_launcher_spec.py Docstring updates to clarify expected return values.
nvflare/app_opt/job_launcher/k8s_launcher.py Major updates to K8s pod manifest creation, lifecycle handling, and resource/PVC wiring.
nvflare/private/fed/client/communicator.py Adds configurable timeout for waiting on cell creation during client registration.
nvflare/app_common/resource_managers/gpu_resource_manager.py Adds ignore_host option to skip host GPU validation.
nvflare/app_common/resource_managers/BE_resource_manager.py New “best effort” resource manager implementation.
nvflare/app_common/resource_consumers/BE_resource_consumer.py New no-op resource consumer.
tests/unit_test/app_opt/job_launcher/k8s_launcher_test.py New unit tests for K8s job handle/launcher behavior.
tests/unit_test/app_opt/job_launcher/docker_launcher_test.py New unit tests for Docker job handle/launcher behavior.
tests/unit_test/app_opt/job_launcher/__init__.py Test package init.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch from 8607f8a to 5ac3bf1 Compare March 19, 2026 00:48
@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch from 5ac3bf1 to 7eb5b0e Compare March 19, 2026 01:50
@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch from 7eb5b0e to 1048b0b Compare March 19, 2026 02:05
@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch 2 times, most recently from de9d011 to 9708848 Compare March 19, 2026 03:09
@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch from 9708848 to 4138a23 Compare March 19, 2026 03:21
Job Launcher for K8s environement
GPU, image, pvc updated and working
Add codes

Add unit tests
@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch from 4138a23 to fb21e24 Compare March 19, 2026 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants