It has already been confirmed that with the right dataset we can scale it effectively from 2k to 4K, and 4K to 8k via fine tuning (you dun even need to train a new foundation model)
We believe this can be done for 16k to way beyond 100k
Research in how RWKV handle the hidden state shows that it is barely used (imo: <5%??) meaning lots of headroom for scaling context size
(This is actively being experimented on - we dun really know the limit yet)
We believe this can be done for 16k to way beyond 100k
Research in how RWKV handle the hidden state shows that it is barely used (imo: <5%??) meaning lots of headroom for scaling context size
(This is actively being experimented on - we dun really know the limit yet)