Job Launcher and Job Handle: design doc/implementation/unit tests (TA: NVFlare developers)#4336
Conversation
Greptile SummaryThis PR introduces the Key changes:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant E as ServerEngine / ClientExecutor
participant EV as Event Bus
participant L as K8sJobLauncher
participant H as K8sJobHandle
participant K8s as Kubernetes API
E->>EV: fire BEFORE_JOB_LAUNCH
EV->>L: handle_event()
L->>EV: add_launcher(self, fl_ctx)
E->>L: launch_job(job_meta, fl_ctx)
L->>H: K8sJobHandle(job_id, api, job_config)
H->>H: _make_manifest()
L->>K8s: create_namespaced_pod(manifest)
L->>H: enter_states([RUNNING], timeout)
loop Poll until RUNNING or timeout/stuck
H->>K8s: read_namespaced_pod()
K8s-->>H: pod phase
alt phase == Pending (too long)
H->>H: _stuck() → True
H->>K8s: delete_namespaced_pod()
H-->>L: return False
else phase == Running
H-->>L: return True
end
end
L-->>E: job_handle
E->>H: wait()
loop Until terminal state
H->>K8s: read_namespaced_pod()
K8s-->>H: Succeeded / Failed
H->>H: terminal_state = SUCCEEDED/TERMINATED
end
E->>H: poll()
H-->>E: JobReturnCode (from terminal_state)
alt Abort requested
E->>H: terminate()
H->>K8s: delete_namespaced_pod(grace=0)
H->>H: terminal_state = TERMINATED
end
|
There was a problem hiding this comment.
Pull request overview
Adds documentation, Kubernetes-oriented job launching support, and unit tests around the new JobLauncher/JobHandle abstractions, plus “best effort” resource management utilities intended for orchestration backends.
Changes:
- Add detailed design documentation for JobLauncherSpec/JobHandleSpec and the Process/Docker/K8s implementations.
- Extend the Kubernetes launcher/handle to build PVC-backed pod manifests, support GPU limits, and add timeout/stuck-handling logic.
- Introduce BE (best-effort) resource manager/consumer utilities and expand GPUResourceManager with an
ignore_hostoption. - Add unit tests for K8s and Docker job launchers/handles.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
docs/design/JobLauncher_and_JobHandle.md |
New design doc describing the launcher/handle architecture and backends. |
nvflare/apis/job_launcher_spec.py |
Docstring updates to clarify expected return values. |
nvflare/app_opt/job_launcher/k8s_launcher.py |
Major updates to K8s pod manifest creation, lifecycle handling, and resource/PVC wiring. |
nvflare/private/fed/client/communicator.py |
Adds configurable timeout for waiting on cell creation during client registration. |
nvflare/app_common/resource_managers/gpu_resource_manager.py |
Adds ignore_host option to skip host GPU validation. |
nvflare/app_common/resource_managers/BE_resource_manager.py |
New “best effort” resource manager implementation. |
nvflare/app_common/resource_consumers/BE_resource_consumer.py |
New no-op resource consumer. |
tests/unit_test/app_opt/job_launcher/k8s_launcher_test.py |
New unit tests for K8s job handle/launcher behavior. |
tests/unit_test/app_opt/job_launcher/docker_launcher_test.py |
New unit tests for Docker job handle/launcher behavior. |
tests/unit_test/app_opt/job_launcher/__init__.py |
Test package init. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
8607f8a to
5ac3bf1
Compare
5ac3bf1 to
7eb5b0e
Compare
7eb5b0e to
1048b0b
Compare
de9d011 to
9708848
Compare
9708848 to
4138a23
Compare
Job Launcher for K8s environement GPU, image, pvc updated and working Add codes Add unit tests
4138a23 to
fb21e24
Compare
Description
Job Launcher and Job Handle design docs
Implementation of Job Launcher and Job Handle for kubernetes (docker is WIP)
Unit tests for implementation
Types of changes
./runtest.sh.