Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ apidocs/index.rst
:caption: Releases
:hidden:
releases/release-process.md
releases/software-versions.md
releases/changelog.md
releases/known-issues.md
Expand Down
1 change: 1 addition & 0 deletions docs/releases/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ This directory contains release information, version history, and known issues f

| Document | Purpose | When to Read |
|----------|---------|--------------|
| **[Release Process](release-process.md)** | Release cycle, RC cadence, code-freeze, golden values, CI annotations | Participating in or understanding a release |
| **[Software Versions](software-versions.md)** | Current software versions and dependencies | Checking compatibility, planning upgrades |
| **[Changelog](https://github.com/NVIDIA-NeMo/Megatron-Bridge/releases)** | Detailed release history and changes | Understanding what changed in each release |
| **[Known Issues](known-issues.md)** | Known bugs, limitations, and workarounds | Troubleshooting issues, planning workarounds |
Expand Down
85 changes: 85 additions & 0 deletions docs/releases/release-process.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Release Developer Guide

## Overview

Our release cycle spans **2 months**. During this window, we develop and land features through a series of Release Candidates (RCs), before entering a code-freeze period for stabilization and a final release.

-----

## Release Candidate Cadence

New RCs are cut every **Saturday**, when the weekly pipeline runs.

|RC |Approximate Timing|Key Activity |
|---|------------------|----------------------------------|
|RC0|Week 1 (7th–10th) |Major dependency bump: NGC PyTorch|
|RC1|Week 2 |Dependency bump: TransformerEngine|
|RC2|Week 3 |Feature development continues |
|RC3|Week 4 |**Code-freeze begins** |
| |Week 5 |Bug fixes, small improvements |
| |Week 6 |Bug fixes, small improvements |
| |Week 7 |QA exit, release |
Comment on lines +11 to +21
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify Saturday cadence vs date-range example to avoid ambiguity.

Line 11 states RCs are cut every Saturday, but Line 15 shows “Week 1 (7th–10th)”, which reads as a multi-day window. Consider making the table strictly week-based (or explicitly marking date ranges as approximate windows).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/releases/release-process.md` around lines 11 - 21, The table in the
release-process doc mixes a "Saturday" cadence with a multi-day date example
(e.g., the "RC0 | Week 1 (7th–10th)" row), which is ambiguous; update the table
entries for RC0–RC3 to use consistent week-based labels or explicitly mark any
multi-day ranges as "approximate" or "window" (e.g., change "Week 1 (7th–10th)"
to "Week 1 (approx. 7th–10th)" or simply "Week 1") and ensure the top text (the
sentence about Saturdays) clarifies that RCs are cut on a Saturday within those
week windows; adjust rows referenced by RC0, RC1, RC2, RC3 accordingly.


RC0 through RC2 are a **feature development phase** — new features are actively being landed. Stabilization begins at RC3 with code-freeze.

From RC3 onward, RCs are cut **more frequently and as needed**, rather than strictly on Saturdays.

-----

## Golden Values

Golden values are reference outputs used to validate model behavior in CI.

### During the RC Phase (before code-freeze)

Golden values are updated **selectively**:

- They are updated if the new values represent an **improvement**, or
- If the team **collectively decides** that a regression is acceptable.

This means golden values are not automatically updated with every run — a deliberate decision is required for any regression.

### On the Release Branch (during code-freeze)

When the release branch is created at code-freeze, all golden values are updated **unconditionally**. Whatever the current output is becomes the new reference baseline for the release.

-----

## Code-Freeze

Code-freeze lasts **two weeks** and begins when RC3 is cut. This is the **stabilization phase** — no new features are landed.

### First Half

- **Release branches are created.**
- All golden values on the release branch are updated unconditionally (see above).
- The **last bulk CI run** occurs one week into the code-freeze period.
- RCs continue to be cut as needed.

### Second Half

- **Engineers are responsible for updating golden values** on the release branch — reviewing any remaining discrepancies and ensuring the suite is in a clean state ahead of release.
- RCs continue to be cut as needed.

### Release Day

The release goes out on the **first Wednesday after the code-freeze window ends**.

-----

## CI and Known Failures

### Ticket-Annotated Tests

Failing CI tests can be linked to a tracking ticket. When a test fails with the **same error code** as the one recorded on its linked ticket, CI reports it as **"passing, with known error"** rather than a hard failure.

This means **a green CI result does not guarantee a fully healthy test suite** — it means there are no *unexpected* failures.

### Important: Keeping Annotations Up to Date

Ticket annotations must be actively maintained in **both directions**:

- **Add** a ticket annotation when a test starts failing with a known, accepted error.
- **Remove** the ticket annotation when the test heals.

If a test recovers but its ticket annotation is not removed, CI will report it as **failing** — because the actual error code no longer matches the one on record. The test being healthy is not enough; the annotation must be cleaned up for CI to go green again.
Loading