>We find that the shared circuitry increases with model scale, with Claude 3.5 Haiku sharing more than twice the proportion of its features between languages as compared to a smaller model.
While it was already generally noticeable, still one more time confirmed that larger model generalizes better instead of using its bigger numbers of parameters just to “memorize by rote” (overfitting).
While it was already generally noticeable, still one more time confirmed that larger model generalizes better instead of using its bigger numbers of parameters just to “memorize by rote” (overfitting).