Skip to content

prabindersinghh/metaflow-nomad

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Nomad extension for Metaflow

metaflow-nomad

Native HashiCorp Nomad execution backend for Metaflow

GSoC 2026 Metaflow Nomad Status License


What This Is

Metaflow supports @kubernetes, @batch, and @slurm for remote step execution. There is no native backend for HashiCorp Nomad.

This project implements a @nomad StepDecorator that executes Metaflow steps as Nomad batch jobs using the Docker task driver — bringing Metaflow's workflow capabilities to the entire HashiCorp ecosystem.

from metaflow import FlowSpec, step, nomad

class TrainFlow(FlowSpec):

    @nomad(image="python:3.11", cpu=500, memory=1024)
    @step
    def train(self):
        import numpy as np
        self.result = np.random.rand(100).mean()
        self.next(self.end)

    @step
    def end(self):
        print(f"Result: {self.result}")

if __name__ == "__main__":
    TrainFlow()

Why Nomad

Kubernetes Nomad
Architecture Complex control plane, etcd, multiple components Single binary, no etcd
Workloads Containers only Docker, binaries, VMs
Ops overhead High Low
HashiCorp fit Separate ecosystem Native Vault, Consul, Terraform integration
Edge/hybrid Difficult First-class support

Teams running Vault, Consul, or Terraform are already in the HashiCorp ecosystem. Nomad is their natural scheduler. Until now, adopting Metaflow meant migrating their entire scheduling infrastructure. This project removes that blocker.


Architecture

The @nomad decorator communicates with Nomad exclusively via its HTTP Jobs API — no SSH, no shell scripts, no HCL.

python flow.py run
       |
@nomad decorator detected
       |
NomadJob builds JSON job spec (CPU, memory, Docker image, env vars)
       |
POST /v1/jobs  ──────────────────────────>  Nomad Cluster
                                                   |
GET /v1/job/:id/allocations  <──────────  Allocation scheduled
       |
Poll ClientStatus: pending | running | complete | failed
       |
GET /v1/client/fs/logs/:alloc_id  (two-phase log retrieval)
       |
Task state + logs returned to Metaflow runtime

Key design decision: Retry uses ReschedulePolicy (not RestartPolicy) — matching Metaflow's job-level failure recovery semantics, not in-place node restart.


Current Status

This repository is the active development scaffold for the GSoC 2026 Metaflow Nomad Integration project.

Component Status
Extension namespace package structure Done
@nomad StepDecorator skeleton Done
NomadClient HTTP wrapper In progress
Job spec builder In progress
Allocation polling + status mapping Planned
Two-phase log streaming Planned
ReschedulePolicy retry integration Planned
OOM detection via TaskStates Events Planned
CI against real Nomad dev agent Planned
Full test suite Planned
Documentation + setup guide Planned

Repository Structure

metaflow-nomad/
├── setup.py
└── metaflow_extensions/
    └── nomad_ext/
        └── plugins/
            ├── mfextinit_nomad_ext.py    # STEP_DECORATORS_DESC registration
            └── nomad/
                ├── __init__.py
                ├── nomad_decorator.py    # @nomad StepDecorator
                ├── nomad_client.py       # HTTP API wrapper
                └── nomad_job.py          # Job spec builder

Follows the official metaflow-extensions-template namespace package pattern.


Decorator Parameters

Parameter Default Description
address http://localhost:4646 Nomad API endpoint
image required Docker image to run the step
cpu 256 CPU in MHz
memory 512 Memory in MB
datacenter dc1 Nomad datacenter
namespace default Nomad namespace
token None ACL token (optional)
timeout_minutes 60 Job-level timeout

Local Development Setup

Requirements: Nomad 1.7.7+, Python 3.9+, Docker

# Start Nomad dev agent
nomad agent -dev

# Verify Nomad is running
curl http://localhost:4646/v1/agent/self

# Clone and install
git clone https://github.com/prabindersinghh/metaflow-nomad
cd metaflow-nomad
pip install -e .

# Run example flow
python examples/basic_flow.py run

Relation to @slurm

This is not a port of @slurm to Nomad. The two backends are architecturally different:

@slurm @nomad
Job submission SSH + sbatch POST /v1/jobs HTTP
Execution Linux-native, no Docker required Docker task driver
Monitoring squeue via SSH GET /v1/job/:id/allocations
Log retrieval SSH + tail file GET /v1/client/fs/logs/:alloc_id
Retry sbatch --requeue ReschedulePolicy
Auth SSH key pair ACL Token header

Every component is designed ground-up for Nomad's HTTP-first, container-native model.


GSoC 2026

This project will be submitted for being developed as part of Google Summer of Code 2026 under the Metaflow / Outerbounds organization.

  • Contributor: Prabinder Singh (@prabindersinghh)
  • Mentor: Madhur Tandon
  • Organization: Outerbounds / Netflix Metaflow
  • Project size: 350 hours (Large)
  • Timeline: June to August 2026

Contributing

This project is under active early development. If you are running Nomad and interested in a native Metaflow integration, feel free to open an issue or reach out on the Metaflow Slack in #gsoc-metaflow-nomad-integration.


References


Built with care for the Metaflow and HashiCorp communities

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages