WIP - Add a TorchX Kubernetes job trait.#2227
Open
johnwhumphreys wants to merge 1 commit intomainfrom
Open
Conversation
|
@johnwhumphreys has imported this pull request. If you are a Meta employee, you can view this in D90090355. |
facebook-github-bot
pushed a commit
that referenced
this pull request
Jan 6, 2026
Summary: This adds a TorchX Kubernetes job trait, but also has to fix an issue in bootstrapping rust code. It has some extra files for now I will thin out before landing. Differential Revision: D90090355 Pulled By: johnwhumphreys
233e662 to
7408d55
Compare
|
@johnwhumphreys has exported this pull request. If you are a Meta employee, you can view the originating Diff in D90090355. |
facebook-github-bot
pushed a commit
that referenced
this pull request
Jan 11, 2026
Summary: # TorchX Kubernetes Job Trait This diff adds a JobTrait for running Monarch on Kubernetes via the TorchX integration (which uses the Volcano gang scheduler under the covers). The TorchX kubernetes integration already worked; so this is really just updating to the JobtTrait and providing a test script / output to show it is working on a live kubernetes cluster. Differential Revision: D90090355 Pulled By: johnwhumphreys
7408d55 to
40ef8e9
Compare
Summary:
# TorchX Kubernetes Job Trait
This diff adds a JobTrait for running Monarch on Kubernetes via the TorchX integration (which uses the Volcano gang scheduler under the covers). The TorchX kubernetes integration already worked; so this is really just updating to the JobtTrait and providing a test script / output to show it is working on a live kubernetes cluster.
Test Plan:
A working example has been provided in the /examples/kubernetes directory and it was tested on the Monarch AWS kubernetes cluster. The test requests 2 workers and does a distributed operation. Here is the output:
```bash
root@monarch-client-demo-john:/workspace# ipython
Python 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.9.0 -- An enhanced Interactive Python. Type '?' for help.
Tip: Use `object?` to see the help on `object`, `object??` to view its source
...: test_actors = workers_proc_mesh.spawn("test_actor", TestActor)
...:
...: # Test 1: Call hello on all actors
...: print("\n--- Test 1: Hello from all actors ---")
...: greetings = await test_actors.hello.call()
...: for rank, greeting in greetings.items():
...: print(f" Rank {rank}: {greeting}")
...:
...: # Test 2: Run a simple computation on all actors
...: print("\n--- Test 2: Distributed computation ---")
...: input_value = 10
...: print(f"Input value: {input_value}")
...: results = await test_actors.compute.call(input_value)
...: print(f"Results from all workers: {results}")
...: total = sum(results.values())
...: print(f"Sum of all results: {total}")
...:
...: # Verify results
...: expected_total = sum(input_value * (i + 1) for i in range(2))
...: if total == expected_total:
...: print("\nSUCCESS! All tests passed.")
...: print(f" Expected total: {expected_total}, Got: {total}")
...: else:
...: print("\nFAILURE! Results don't match.")
...: print(f" Expected total: {expected_total}, Got: {total}")
...: sys.exit(1)
...:
...: except Exception as e:
...: print(f"\n❌ ERROR: {e}")
...: import traceback
...:
...: traceback.print_exc()
...: sys.exit(1)
...: finally:
...: # Clean up the job
...: print("\n[Cleanup] Killing job...")
...: job.kill()
...: print("Done!")
...:
...:
...: if __name__ == "__main__":
...: asyncio.run(main())
...:
================================================================================
Monarch Kubernetes JobTrait Test
================================================================================
[1/4] Creating TorchXKubernetesJob...
[2/4] Submitting job and waiting for workers to start...
This may take a few minutes while Kubernetes schedules pods...
Found cached job at path: .monarch/job_state.pkl
TorchX job kubernetes://torchx/monarch-tests:monarch-test-1768178212-ssrtpkzfzsl93c not found
Cached job cannot run this spec, removing cache
Cancelled TorchX job kubernetes://torchx/monarch-tests:monarch-test-1768178212-ssrtpkzfzsl93c
Applying current job
Submitting TorchX Kubernetes job 'monarch-test-1768178365' with 2 replicas to namespace 'monarch-tests'
TorchX job submitted: kubernetes://torchx/monarch-tests:monarch-test-1768178365-qtqgnwv3mcwrnd
Job has started, connecting to current state
TorchX job kubernetes://torchx/monarch-tests:monarch-test-1768178365-qtqgnwv3mcwrnd is running with 2 pods: ['10-0-159-6.monarch-tests.pod.cluster.local', '10-0-159-3.monarch-tests.pod.cluster.local']
Connecting to workers for mesh 'workers': ['tcp://10-0-159-6.monarch-tests.pod.cluster.local:22222', 'tcp://10-0-159-3.monarch-tests.pod.cluster.local:22222']
Monarch internal logs are being written to /tmp/root/monarch_log.log; execution id root_Jan-12_00:39_472
Saving job to cache at .monarch/job_state.pkl
[3/4] Workers ready! HostMesh extent: {hosts: 2}
Waiting for HostMesh initialization...
HostMesh initialized!
Spawning ProcMesh from HostMesh...
ProcMesh extent: {hosts: 2}
Waiting for ProcMesh initialization...
ProcMesh initialized!
[4/4] Spawning test actors and running computations...
--- Test 1: Hello from all actors ---
Rank {'hosts': 0/2}: Hello from rank {'hosts': 0}, world size: {'hosts': 2}
Rank {'hosts': 1/2}: Hello from rank {'hosts': 1}, world size: {'hosts': 2}
--- Test 2: Distributed computation ---
Input value: 10
Results from all workers: ValueMesh({hosts: 2}):
(({'hosts': 0/2}, 10), ({'hosts': 1/2}, 20))
Sum of all results: 30
SUCCESS! All tests passed.
Expected total: 30, Got: 30
[Cleanup] Killing job...
Cancelled TorchX job kubernetes://torchx/monarch-tests:monarch-test-1768178365-qtqgnwv3mcwrnd
Done!
```
Differential Revision: D90090355
Pulled By: johnwhumphreys
40ef8e9 to
9e4bc3b
Compare
johnwhumphreys
added a commit
to johnwhumphreys/monarch
that referenced
this pull request
Jan 12, 2026
) Summary: This adds a TorchX Kubernetes job trait, but also has to fix an issue in bootstrapping rust code. It has some extra files for now I will thin out before landing. Pull Request resolved: meta-pytorch#2227 Differential Revision: D90090355 Pulled By: johnwhumphreys
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds a TorchX Kubernetes job trait, but also has to fix an issue in bootstrapping rust code. It has some extra files for now I will thin out before landing.