Do we have estimates on the corpus that is available? This model's repo describes "multiple strategies to generate massive diverse synthetic reasoning data." FWIW, AI 2027 forecasts heavy emphasis on synthetic data creation.
Is the lack of existing corpus just an extra hurdle for Hanzi-first models that are also leading the pack in benchmarks?
Is the lack of existing corpus just an extra hurdle for Hanzi-first models that are also leading the pack in benchmarks?