Kubeflow Trainer

Overview

Kubeflow Trainer is a Kubernetes-native distributed AI platform for scalable large language model (LLM) fine-tuning and training of AI models across a wide range of frameworks, including PyTorch, MLX, HuggingFace, DeepSpeed, JAX, XGBoost, and more.

Kubeflow Trainer brings MPI to Kubernetes, orchestrating multi-node, multi-GPU distributed jobs efficiently across high-performance computing (HPC) clusters. This enables high-throughput communication between processes, making it ideal for large-scale AI training that requires ultra-fast synchronization between GPUs nodes.

Kubeflow Trainer seamlessly integrates with the Cloud Native AI ecosystem, including Kueue for topology-aware scheduling and multi-cluster job dispatching, as well as JobSet and LeaderWorkerSet for AI workload orchestration.

Kubeflow Trainer provides a distributed data cache designed to stream large-scale data with zero-copy transfer directly to GPU nodes. This ensures memory-efficient training jobs while maximizing GPU utilization.

With the Kubeflow Python SDK, AI practitioners can effortlessly develop and fine-tune LLMs while leveraging the Kubeflow Trainer APIs: TrainJob and Runtimes.

Kubeflow Trainer Introduction

The following KubeCon + CloudNativeCon 2024 talk provides an overview of Kubeflow Trainer capabilities:

Getting Started

Please check the official Kubeflow Trainer documentation to install and get started with Kubeflow Trainer.

Community

The following links provide information on how to get involved in the community:

Join our #kubeflow-trainer Slack channel.
Attend the bi-weekly AutoML and Training Working Group community meeting.
Check out who is using Kubeflow Trainer.

Contributing

Please refer to the CONTRIBUTING guide.

Changelog

Please refer to the CHANGELOG.

Kubeflow Training Operator V1

Kubeflow Trainer project is currently in alpha status, and APIs may change. If you are using Kubeflow Training Operator V1, please refer to this migration document.

Kubeflow Community will maintain the Training Operator V1 source code at the release-1.9 branch.

You can find the documentation for Kubeflow Training Operator V1 in these guides.

Acknowledgement

This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow Training Operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.

PyTorch Operator: list of contributors and maintainers.
MPI Operator: list of contributors and maintainers.
XGBoost Operator: list of contributors and maintainers.
Common library: list of contributors and maintainers.

Name		Name	Last commit message	Last commit date
Latest commit History 1,499 Commits
.github		.github
api		api
charts/kubeflow-trainer		charts/kubeflow-trainer
cmd		cmd
docs		docs
examples		examples
hack		hack
manifests		manifests
pkg		pkg
test		test
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.golangci-kal.yml		.golangci-kal.yml
.golangci.yaml		.golangci.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
ADOPTERS.md		ADOPTERS.md
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
VERSION		VERSION
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kubeflow Trainer

Overview

Kubeflow Trainer Introduction

Getting Started

Community

Contributing

Changelog

Kubeflow Training Operator V1

Acknowledgement

About

Uh oh!

Releases 51

Packages

Uh oh!

Uh oh!

Contributors 247

Uh oh!

Languages

License

kubeflow/trainer

Folders and files

Latest commit

History

Repository files navigation

Kubeflow Trainer

Overview

Kubeflow Trainer Introduction

Getting Started

Community

Contributing

Changelog

Kubeflow Training Operator V1

Acknowledgement

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 51

Packages 0

Uh oh!

Uh oh!

Contributors 247

Uh oh!

Languages

Packages