Native HashiCorp Nomad execution backend for Metaflow
Metaflow supports @kubernetes, @batch, and @slurm for remote step execution. There is no native backend for HashiCorp Nomad.
This project implements a @nomad StepDecorator that executes Metaflow steps as Nomad batch jobs using the Docker task driver — bringing Metaflow's workflow capabilities to the entire HashiCorp ecosystem.
from metaflow import FlowSpec, step, nomad
class TrainFlow(FlowSpec):
@nomad(image="python:3.11", cpu=500, memory=1024)
@step
def train(self):
import numpy as np
self.result = np.random.rand(100).mean()
self.next(self.end)
@step
def end(self):
print(f"Result: {self.result}")
if __name__ == "__main__":
TrainFlow()| Kubernetes | Nomad | |
|---|---|---|
| Architecture | Complex control plane, etcd, multiple components | Single binary, no etcd |
| Workloads | Containers only | Docker, binaries, VMs |
| Ops overhead | High | Low |
| HashiCorp fit | Separate ecosystem | Native Vault, Consul, Terraform integration |
| Edge/hybrid | Difficult | First-class support |
Teams running Vault, Consul, or Terraform are already in the HashiCorp ecosystem. Nomad is their natural scheduler. Until now, adopting Metaflow meant migrating their entire scheduling infrastructure. This project removes that blocker.
The @nomad decorator communicates with Nomad exclusively via its HTTP Jobs API — no SSH, no shell scripts, no HCL.
python flow.py run
|
@nomad decorator detected
|
NomadJob builds JSON job spec (CPU, memory, Docker image, env vars)
|
POST /v1/jobs ──────────────────────────> Nomad Cluster
|
GET /v1/job/:id/allocations <────────── Allocation scheduled
|
Poll ClientStatus: pending | running | complete | failed
|
GET /v1/client/fs/logs/:alloc_id (two-phase log retrieval)
|
Task state + logs returned to Metaflow runtime
Key design decision: Retry uses ReschedulePolicy (not RestartPolicy) — matching Metaflow's job-level failure recovery semantics, not in-place node restart.
This repository is the active development scaffold for the GSoC 2026 Metaflow Nomad Integration project.
| Component | Status |
|---|---|
| Extension namespace package structure | Done |
@nomad StepDecorator skeleton |
Done |
| NomadClient HTTP wrapper | In progress |
| Job spec builder | In progress |
| Allocation polling + status mapping | Planned |
| Two-phase log streaming | Planned |
| ReschedulePolicy retry integration | Planned |
| OOM detection via TaskStates Events | Planned |
| CI against real Nomad dev agent | Planned |
| Full test suite | Planned |
| Documentation + setup guide | Planned |
metaflow-nomad/
├── setup.py
└── metaflow_extensions/
└── nomad_ext/
└── plugins/
├── mfextinit_nomad_ext.py # STEP_DECORATORS_DESC registration
└── nomad/
├── __init__.py
├── nomad_decorator.py # @nomad StepDecorator
├── nomad_client.py # HTTP API wrapper
└── nomad_job.py # Job spec builder
Follows the official metaflow-extensions-template namespace package pattern.
| Parameter | Default | Description |
|---|---|---|
address |
http://localhost:4646 |
Nomad API endpoint |
image |
required | Docker image to run the step |
cpu |
256 |
CPU in MHz |
memory |
512 |
Memory in MB |
datacenter |
dc1 |
Nomad datacenter |
namespace |
default |
Nomad namespace |
token |
None |
ACL token (optional) |
timeout_minutes |
60 |
Job-level timeout |
Requirements: Nomad 1.7.7+, Python 3.9+, Docker
# Start Nomad dev agent
nomad agent -dev
# Verify Nomad is running
curl http://localhost:4646/v1/agent/self
# Clone and install
git clone https://github.com/prabindersinghh/metaflow-nomad
cd metaflow-nomad
pip install -e .
# Run example flow
python examples/basic_flow.py runThis is not a port of @slurm to Nomad. The two backends are architecturally different:
@slurm |
@nomad |
|
|---|---|---|
| Job submission | SSH + sbatch |
POST /v1/jobs HTTP |
| Execution | Linux-native, no Docker required | Docker task driver |
| Monitoring | squeue via SSH |
GET /v1/job/:id/allocations |
| Log retrieval | SSH + tail file | GET /v1/client/fs/logs/:alloc_id |
| Retry | sbatch --requeue |
ReschedulePolicy |
| Auth | SSH key pair | ACL Token header |
Every component is designed ground-up for Nomad's HTTP-first, container-native model.
This project will be submitted for being developed as part of Google Summer of Code 2026 under the Metaflow / Outerbounds organization.
- Contributor: Prabinder Singh (@prabindersinghh)
- Mentor: Madhur Tandon
- Organization: Outerbounds / Netflix Metaflow
- Project size: 350 hours (Large)
- Timeline: June to August 2026
This project is under active early development. If you are running Nomad and interested in a native Metaflow integration, feel free to open an issue or reach out on the Metaflow Slack in #gsoc-metaflow-nomad-integration.
- HashiCorp Nomad Documentation
- Nomad Jobs API
- Metaflow Extensions Template
- metaflow-slurm Reference Implementation
- Metaflow Documentation
Built with care for the Metaflow and HashiCorp communities