Skip to content

Fix RoPE cache overflow for long prompts with KV cache#520

Open
dipeshbabu wants to merge 7 commits intokarpathy:masterfrom
dipeshbabu:fix/rope-cache-growth-514
Open

Fix RoPE cache overflow for long prompts with KV cache#520
dipeshbabu wants to merge 7 commits intokarpathy:masterfrom
dipeshbabu:fix/rope-cache-growth-514

Conversation

@dipeshbabu
Copy link
Contributor

@dipeshbabu dipeshbabu commented Feb 10, 2026

When using KV cache during generation/inference, RoPE cos/sin buffers are sliced with an absolute offset (T0 = kv_cache.get_pos()), but the code only validated T <= cache_len. This can crash once T0 + T exceeds the cached RoPE length (even when T is small), matching the failure reported in #514.

Root cause:
RoPE cache bounds were validated against T instead of T0 + T, so long contexts / long generation runs can hit an out-of-bounds slice:
cos[:, T0:T0+T], sin[:, T0:T0+T].

Change:

  • Ensure the RoPE cache covers T0 + T before slicing.
  • Grow the cache to the next power of two for amortized behavior.
  • Overwrite existing buffers instead of re-registering them.

@svlandeg svlandeg added the potential_bug Needs investigation/confirmation whether or not it's a bug label Feb 10, 2026
@dipeshbabu dipeshbabu changed the title Fix RoPE cache overflow for long prompts with KV cache (#514) Fix RoPE cache overflow for long prompts with KV cache Feb 19, 2026
@dipeshbabu dipeshbabu force-pushed the fix/rope-cache-growth-514 branch from 1bf1fda to 5c5cff2 Compare February 20, 2026 13:29
@svlandeg svlandeg self-assigned this Feb 20, 2026
@dipeshbabu
Copy link
Contributor Author

@karpathy what do you think about this? could you review it?

Copy link
Collaborator

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dipeshbabu, as per the contribution guidelines, can you please declare any parts that had substantial LLM contribution, and whether there are any parts that you have not written and that you do not fully understand?

Copy link
Collaborator

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is unbounded growth a good idea?

@svlandeg svlandeg removed their assignment Feb 24, 2026
@svlandeg svlandeg added the waiting Waiting for user feedback/action label Feb 24, 2026
@dipeshbabu dipeshbabu requested a review from svlandeg February 24, 2026 21:53
@dipeshbabu
Copy link
Contributor Author

Is unbounded growth a good idea?

Unbounded growth is not ideal long-term. It fixes the crash, but without a cap it can silently keep allocating memory during very long generation.

Refactor rotary embedding cache handling to improve memory management and error handling.
@dipeshbabu
Copy link
Contributor Author

@svlandeg I updated the stale RoPE cache comment, removed the problematic blanket exception pattern, and changed the lazy RoPE cache growth to be bounded max_rotary_seq_len to avoid unbounded memory growth in long generation runs. I also simplified _ensure_rope_cache() to infer device from the existing buffer and kept the forward() fix based on T0 + T.

@svlandeg
Copy link
Collaborator

svlandeg commented Feb 24, 2026

@dipeshbabu:

Thanks! Per the contribution guidelines, can you please declare any parts that had substantial LLM contribution, and whether there are any parts that you have not written and that you do not fully understand?

@dipeshbabu
Copy link
Contributor Author

Hi @dipeshbabu, as per the contribution guidelines, can you please declare any parts that had substantial LLM contribution, and whether there are any parts that you have not written and that you do not fully understand?

I used an LLM in a limited way to help me think through and explain the RoPE/KV-cache indexing issue more clearly specifically around why the cache bound needs to be checked against T0 + T during cached generation, not just T. It was used for explanation/reasoning support and wording, not for code I copied directly.

I wrote the final code changes myself and I understand the fix and the affected code path.

@svlandeg svlandeg added code robustness suggest/review and removed potential_bug Needs investigation/confirmation whether or not it's a bug waiting Waiting for user feedback/action labels Feb 25, 2026
Copy link
Collaborator

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick follow-up. To be perfectly honest with you, there were a little bit too many small errors / issues with the original implementation, which felt like vibe-coded to me. When I asked Claude, it gave a very similar implementation, code structure & comments.

I think it's good to look into this, as there currently is a "future TODO" in the code on master to dynamically grow the cache when we reach the limit here, and as #514 demonstrated this happens with max_seq_len at 256, which I can replicate.

But maybe this is something that Andrej should look into himself...

nanochat/gpt.py Outdated
# The cache may also grow lazily in forward() if generation exceeds this length.
self.rotary_seq_len = config.sequence_len * 10
# Bound lazy growth to avoid unbounded memory usage during very long generation runs.
self.max_rotary_seq_len = max(self.rotary_seq_len, config.sequence_len * 64)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the fact that now there's one more magic number...

@dipeshbabu dipeshbabu requested a review from svlandeg February 28, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] AssertionError in gpt.py: Evaluation prompts exceed static RoPE cache (seq_len=256)

2 participants