Skip to content

Comments

WIP - Add a TorchX Kubernetes job trait.#2227

Open
johnwhumphreys wants to merge 1 commit intomainfrom
torchx-job-trait
Open

WIP - Add a TorchX Kubernetes job trait.#2227
johnwhumphreys wants to merge 1 commit intomainfrom
torchx-job-trait

Conversation

@johnwhumphreys
Copy link
Contributor

This adds a TorchX Kubernetes job trait, but also has to fix an issue in bootstrapping rust code. It has some extra files for now I will thin out before landing.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 5, 2026
@meta-codesync
Copy link

meta-codesync bot commented Jan 5, 2026

@johnwhumphreys has imported this pull request. If you are a Meta employee, you can view this in D90090355.

facebook-github-bot pushed a commit that referenced this pull request Jan 6, 2026
Summary:
This adds a TorchX Kubernetes job trait, but also has to fix an issue in bootstrapping rust code.  It has some extra files for now I will thin out before landing.


Differential Revision: D90090355

Pulled By: johnwhumphreys
@meta-codesync
Copy link

meta-codesync bot commented Jan 6, 2026

@johnwhumphreys has exported this pull request. If you are a Meta employee, you can view the originating Diff in D90090355.

facebook-github-bot pushed a commit that referenced this pull request Jan 11, 2026
Summary:
# TorchX Kubernetes Job Trait

This diff adds a JobTrait for running Monarch on Kubernetes via the TorchX integration (which uses the Volcano gang scheduler under the covers).  The TorchX kubernetes integration already worked; so this is really just updating to the JobtTrait and providing a test script / output to show it is working on a live kubernetes cluster.


Differential Revision: D90090355

Pulled By: johnwhumphreys
Summary:
# TorchX Kubernetes Job Trait

This diff adds a JobTrait for running Monarch on Kubernetes via the TorchX integration (which uses the Volcano gang scheduler under the covers).  The TorchX kubernetes integration already worked; so this is really just updating to the JobtTrait and providing a test script / output to show it is working on a live kubernetes cluster.


Test Plan:
A working example has been provided in the /examples/kubernetes directory and it was tested on the Monarch AWS kubernetes cluster.  The test requests 2 workers and does a distributed operation.  Here is the output:

```bash
root@monarch-client-demo-john:/workspace# ipython
Python 3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.9.0 -- An enhanced Interactive Python. Type '?' for help.
Tip: Use `object?` to see the help on `object`, `object??` to view its source

   ...:         test_actors = workers_proc_mesh.spawn("test_actor", TestActor)
   ...:
   ...:         # Test 1: Call hello on all actors
   ...:         print("\n--- Test 1: Hello from all actors ---")
   ...:         greetings = await test_actors.hello.call()
   ...:         for rank, greeting in greetings.items():
   ...:             print(f"  Rank {rank}: {greeting}")
   ...:
   ...:         # Test 2: Run a simple computation on all actors
   ...:         print("\n--- Test 2: Distributed computation ---")
   ...:         input_value = 10
   ...:         print(f"Input value: {input_value}")
   ...:         results = await test_actors.compute.call(input_value)
   ...:         print(f"Results from all workers: {results}")
   ...:         total = sum(results.values())
   ...:         print(f"Sum of all results: {total}")
   ...:
   ...:         # Verify results
   ...:         expected_total = sum(input_value * (i + 1) for i in range(2))
   ...:         if total == expected_total:
   ...:             print("\nSUCCESS! All tests passed.")
   ...:             print(f"   Expected total: {expected_total}, Got: {total}")
   ...:         else:
   ...:             print("\nFAILURE! Results don't match.")
   ...:             print(f"   Expected total: {expected_total}, Got: {total}")
   ...:             sys.exit(1)
   ...:
   ...:     except Exception as e:
   ...:         print(f"\n❌ ERROR: {e}")
   ...:         import traceback
   ...:
   ...:         traceback.print_exc()
   ...:         sys.exit(1)
   ...:     finally:
   ...:         # Clean up the job
   ...:         print("\n[Cleanup] Killing job...")
   ...:         job.kill()
   ...:         print("Done!")
   ...:
   ...:
   ...: if __name__ == "__main__":
   ...:     asyncio.run(main())
   ...:
================================================================================
Monarch Kubernetes JobTrait Test
================================================================================

[1/4] Creating TorchXKubernetesJob...

[2/4] Submitting job and waiting for workers to start...
This may take a few minutes while Kubernetes schedules pods...
Found cached job at path: .monarch/job_state.pkl
TorchX job kubernetes://torchx/monarch-tests:monarch-test-1768178212-ssrtpkzfzsl93c not found
Cached job cannot run this spec, removing cache
Cancelled TorchX job kubernetes://torchx/monarch-tests:monarch-test-1768178212-ssrtpkzfzsl93c
Applying current job
Submitting TorchX Kubernetes job 'monarch-test-1768178365' with 2 replicas to namespace 'monarch-tests'
TorchX job submitted: kubernetes://torchx/monarch-tests:monarch-test-1768178365-qtqgnwv3mcwrnd
Job has started, connecting to current state
TorchX job kubernetes://torchx/monarch-tests:monarch-test-1768178365-qtqgnwv3mcwrnd is running with 2 pods: ['10-0-159-6.monarch-tests.pod.cluster.local', '10-0-159-3.monarch-tests.pod.cluster.local']
Connecting to workers for mesh 'workers': ['tcp://10-0-159-6.monarch-tests.pod.cluster.local:22222', 'tcp://10-0-159-3.monarch-tests.pod.cluster.local:22222']
Monarch internal logs are being written to /tmp/root/monarch_log.log; execution id root_Jan-12_00:39_472
Saving job to cache at .monarch/job_state.pkl

[3/4] Workers ready! HostMesh extent: {hosts: 2}
    Waiting for HostMesh initialization...
    HostMesh initialized!
    Spawning ProcMesh from HostMesh...
    ProcMesh extent: {hosts: 2}
    Waiting for ProcMesh initialization...
    ProcMesh initialized!

[4/4] Spawning test actors and running computations...

--- Test 1: Hello from all actors ---
  Rank {'hosts': 0/2}: Hello from rank {'hosts': 0}, world size: {'hosts': 2}
  Rank {'hosts': 1/2}: Hello from rank {'hosts': 1}, world size: {'hosts': 2}

--- Test 2: Distributed computation ---
Input value: 10
Results from all workers: ValueMesh({hosts: 2}):
  (({'hosts': 0/2}, 10), ({'hosts': 1/2}, 20))
Sum of all results: 30

SUCCESS! All tests passed.
   Expected total: 30, Got: 30

[Cleanup] Killing job...
Cancelled TorchX job kubernetes://torchx/monarch-tests:monarch-test-1768178365-qtqgnwv3mcwrnd
Done!
```

Differential Revision: D90090355

Pulled By: johnwhumphreys
johnwhumphreys added a commit to johnwhumphreys/monarch that referenced this pull request Jan 12, 2026
)

Summary:
This adds a TorchX Kubernetes job trait, but also has to fix an issue in bootstrapping rust code.  It has some extra files for now I will thin out before landing.

Pull Request resolved: meta-pytorch#2227

Differential Revision: D90090355

Pulled By: johnwhumphreys
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant