Some of that is, or at least was, down to the training: extending the context wi...

Some of that is, or at least was, down to the training: extending the context window but not training on sufficiently long data or using weak evaluation metrics caused issues. More recent models have been getting better, though long context performance is still not as good as short context performance, even if the definition of "short context" has been greatly extended.

RoPE is great and all, but doesn't magically give 100% performance over the lengthened context; that takes more work.