Skip to content

Change default sliding window pattern to the recommended "L" when FA3 is not available#509

Open
ddudek wants to merge 2 commits intokarpathy:masterfrom
ddudek:sliding-window-fa3-fallback
Open

Change default sliding window pattern to the recommended "L" when FA3 is not available#509
ddudek wants to merge 2 commits intokarpathy:masterfrom
ddudek:sliding-window-fa3-fallback

Conversation

@ddudek
Copy link

@ddudek ddudek commented Feb 6, 2026

Changes default setting of sliding window pattern for setups without out-of-the-box FA3 support to the recommended in the warning.

This simplifies configuration for beginners running nanochat on their local setups, e.g. consumer grade GPUs like 3090/4090 and others without FA3 support.

Before:

$ python -m scripts.base_train --depth=12 --device-batch-size=16
...
GPU: NVIDIA GeForce RTX 3090 | Peak FLOPS (BF16): 7.10e+13
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Flash Attention 3 not available, using PyTorch SDPA fallback
WARNING: Training will be less efficient without FA3
WARNING: SDPA has no support for sliding window attention (window_pattern='SSSL'). Your GPU utilization will be terrible.
WARNING: Recommend using --window-pattern L for full context attention without alternating sliding window patterns.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Vocab size: 32,768
Model config:
{
  "sequence_len": 2048,
  "vocab_size": 32768,
  "n_layer": 12,
  "n_head": 6,
  "n_kv_head": 6,
  "n_embd": 768,
  "window_pattern": "SSSL"
}
...
step 00011/02205 (0.50%) | loss: 8.170549 | lrm: 1.00 | dt: 13119.33ms | tok/sec: 39,963 | mfu: 45.15 | epoch: 1 | total time: 0.22m | eta: 479.7m

After:

$ python -m scripts.base_train --depth=12 --device-batch-size=16
...
GPU: NVIDIA GeForce RTX 3090 | Peak FLOPS (BF16): 7.10e+13
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Flash Attention 3 not available, using PyTorch SDPA fallback
WARNING: Training will be less efficient without FA3
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Vocab size: 32,768
Model config:
{
  "sequence_len": 2048,
  "vocab_size": 32768,
  "n_layer": 12,
  "n_head": 6,
  "n_kv_head": 6,
  "n_embd": 768,
  "window_pattern": "L"
}
...
step 00011/02205 (0.50%) | loss: 8.177470 | lrm: 1.00 | dt: 7127.85ms | tok/sec: 73,554 | mfu: 91.90 | epoch: 1 | total time: 0.12m | eta: 260.6m

Copy link
Collaborator

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea and super minimal edit. As you mention, merging this would make the script behave more user-friendly for beginners with non-FA3 setups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants