Fix RoPE cache overflow for long prompts with KV cache#520
Fix RoPE cache overflow for long prompts with KV cache#520dipeshbabu wants to merge 7 commits intokarpathy:masterfrom
Conversation
1bf1fda to
5c5cff2
Compare
|
@karpathy what do you think about this? could you review it? |
svlandeg
left a comment
There was a problem hiding this comment.
Hi @dipeshbabu, as per the contribution guidelines, can you please declare any parts that had substantial LLM contribution, and whether there are any parts that you have not written and that you do not fully understand?
svlandeg
left a comment
There was a problem hiding this comment.
Is unbounded growth a good idea?
Unbounded growth is not ideal long-term. It fixes the crash, but without a cap it can silently keep allocating memory during very long generation. |
Refactor rotary embedding cache handling to improve memory management and error handling.
|
@svlandeg I updated the stale RoPE cache comment, removed the problematic blanket exception pattern, and changed the lazy RoPE cache growth to be bounded |
|
Thanks! Per the contribution guidelines, can you please declare any parts that had substantial LLM contribution, and whether there are any parts that you have not written and that you do not fully understand? |
I used an LLM in a limited way to help me think through and explain the RoPE/KV-cache indexing issue more clearly specifically around why the cache bound needs to be checked against T0 + T during cached generation, not just T. It was used for explanation/reasoning support and wording, not for code I copied directly. I wrote the final code changes myself and I understand the fix and the affected code path. |
svlandeg
left a comment
There was a problem hiding this comment.
Thanks for the quick follow-up. To be perfectly honest with you, there were a little bit too many small errors / issues with the original implementation, which felt like vibe-coded to me. When I asked Claude, it gave a very similar implementation, code structure & comments.
I think it's good to look into this, as there currently is a "future TODO" in the code on master to dynamically grow the cache when we reach the limit here, and as #514 demonstrated this happens with max_seq_len at 256, which I can replicate.
But maybe this is something that Andrej should look into himself...
nanochat/gpt.py
Outdated
| # The cache may also grow lazily in forward() if generation exceeds this length. | ||
| self.rotary_seq_len = config.sequence_len * 10 | ||
| # Bound lazy growth to avoid unbounded memory usage during very long generation runs. | ||
| self.max_rotary_seq_len = max(self.rotary_seq_len, config.sequence_len * 64) |
There was a problem hiding this comment.
I don't like the fact that now there's one more magic number...
When using KV cache during generation/inference, RoPE cos/sin buffers are sliced with an absolute offset (
T0 = kv_cache.get_pos()), but the code only validatedT <= cache_len. This can crash onceT0 + Texceeds the cached RoPE length (even whenTis small), matching the failure reported in #514.Root cause:
RoPE cache bounds were validated against
Tinstead ofT0 + T, so long contexts / long generation runs can hit an out-of-bounds slice:cos[:, T0:T0+T],sin[:, T0:T0+T].Change:
T0 + Tbefore slicing.