Fix RoPE cache overflow for long prompts with KV cache by dipeshbabu · Pull Request #520 · karpathy/nanochat

dipeshbabu · 2026-02-10T01:49:00Z

When using KV cache during generation/inference, RoPE cos/sin buffers are sliced with an absolute offset (T0 = kv_cache.get_pos()), but the code only validated T <= cache_len. This can crash once T0 + T exceeds the cached RoPE length (even when T is small), matching the failure reported in #514.

Root cause:
RoPE cache bounds were validated against T instead of T0 + T, so long contexts / long generation runs can hit an out-of-bounds slice:
cos[:, T0:T0+T], sin[:, T0:T0+T].

Change:

Ensure the RoPE cache covers T0 + T before slicing.
Grow the cache to the next power of two for amortized behavior.
Overwrite existing buffers instead of re-registering them.

dipeshbabu · 2026-02-24T05:34:39Z

@karpathy what do you think about this? could you review it?

svlandeg

Hi @dipeshbabu, as per the contribution guidelines, can you please declare any parts that had substantial LLM contribution, and whether there are any parts that you have not written and that you do not fully understand?

nanochat/gpt.py

svlandeg

Is unbounded growth a good idea?

nanochat/gpt.py

dipeshbabu · 2026-02-24T21:59:11Z

Is unbounded growth a good idea?

Unbounded growth is not ideal long-term. It fixes the crash, but without a cap it can silently keep allocating memory during very long generation.

Refactor rotary embedding cache handling to improve memory management and error handling.

dipeshbabu · 2026-02-24T22:39:40Z

@svlandeg I updated the stale RoPE cache comment, removed the problematic blanket exception pattern, and changed the lazy RoPE cache growth to be bounded max_rotary_seq_len to avoid unbounded memory growth in long generation runs. I also simplified _ensure_rope_cache() to infer device from the existing buffer and kept the forward() fix based on T0 + T.

svlandeg · 2026-02-24T22:44:14Z

@dipeshbabu:

Thanks! Per the contribution guidelines, can you please declare any parts that had substantial LLM contribution, and whether there are any parts that you have not written and that you do not fully understand?

dipeshbabu · 2026-02-24T22:52:01Z

Hi @dipeshbabu, as per the contribution guidelines, can you please declare any parts that had substantial LLM contribution, and whether there are any parts that you have not written and that you do not fully understand?

I used an LLM in a limited way to help me think through and explain the RoPE/KV-cache indexing issue more clearly specifically around why the cache bound needs to be checked against T0 + T during cached generation, not just T. It was used for explanation/reasoning support and wording, not for code I copied directly.

I wrote the final code changes myself and I understand the fix and the affected code path.

svlandeg

Thanks for the quick follow-up. To be perfectly honest with you, there were a little bit too many small errors / issues with the original implementation, which felt like vibe-coded to me. When I asked Claude, it gave a very similar implementation, code structure & comments.

I think it's good to look into this, as there currently is a "future TODO" in the code on master to dynamically grow the cache when we reach the limit here, and as #514 demonstrated this happens with max_seq_len at 256, which I can replicate.

But maybe this is something that Andrej should look into himself...

svlandeg · 2026-02-25T10:56:35Z

nanochat/gpt.py

+        # The cache may also grow lazily in forward() if generation exceeds this length.
+        self.rotary_seq_len = config.sequence_len * 10
+        # Bound lazy growth to avoid unbounded memory usage during very long generation runs.
+        self.max_rotary_seq_len = max(self.rotary_seq_len, config.sequence_len * 64)


I don't like the fact that now there's one more magic number...

svlandeg added the potential_bug Needs investigation/confirmation whether or not it's a bug label Feb 10, 2026

svlandeg linked an issue Feb 15, 2026 that may be closed by this pull request

[BUG] AssertionError in gpt.py: Evaluation prompts exceed static RoPE cache (seq_len=256) #514

Open

dipeshbabu changed the title ~~Fix RoPE cache overflow for long prompts with KV cache (#514)~~ Fix RoPE cache overflow for long prompts with KV cache Feb 19, 2026

dipeshbabu added 2 commits February 20, 2026 08:29

fix RoPE cache overflow with kv-cache by growing rope buffers

3cb530c

remove unused import

5c5cff2

dipeshbabu force-pushed the fix/rope-cache-growth-514 branch from 1bf1fda to 5c5cff2 Compare February 20, 2026 13:29

svlandeg self-assigned this Feb 20, 2026

fix: grow RoPE cache for KV-cache inference

7e42702