https://granite-code.github.io/granite-completebench/ runs CrossCodeEval against different models with different prompting strategies, but doesn't exactly reproduce how Continue works.
As an adjunct, having something that we can run inside continue that tests autocompletion exactly how Continue will provide a more accurate measurement of completion performance.
This also can be used to provide input to granite-completebench with more representative context snippets.