Hi,
I am working on long embedding vector sequence problem where embedding size is 500k and I want to predict multistep ahead like 100 time steps. I followed gpt2 type architecture but I got memory problem since my embedding vector (inputs) is 500k. The memory problem comes from attention but I can replace successfully using your attention but still I got memory problem coming from nn.Linear(emb_size, emb_size) in gpt2 architecture. do you have example or suggestion to use gpt2 architecture with your attention.
Thank you