Forget single character tokens, you can just go on OpenAI's own tokenizer websit...

krackers · 2025-06-04T02:01:14 1749002474

Neat finding, does it generalize to larger samples? Someone should randomly generate a few thousand such strings, feed it to 4o or o3, and get some accuracy results. Then compare the accuracy in cases of counting individual letters in random strings.

I find there's a lot of low-hanging fruit and claims about LLMs that are easily testable, but for which no benchmarks exist. E.g. the common claim about LLMs being "unable" to multiply isn't fully accurate, someone did a proper benchmark and found that there's a gradual decline in accuracy as digit length increases past 10 digits by 10 digits. I can't find the specific paper, but I also remember there was a way of training a model on increasingly hard problems at the "frontier" (GRPO-esque?) that fixed this issue, giving very high accuracy up to 20 digits by 20 digits.

Kranar · 2025-06-04T02:09:26 1749002966

Oh that's fair. I am not actually an LLM expert so I could have some misunderstanding about this. I remember hearing this explanation given for why previous ChatGPT models failed to answer "How many "r"s are in strawberry?", but perhaps this was an over simplification.

krackers · 2025-06-04T02:20:32 1749003632

Right that's the explanation I've heard too (and I think Karpathy even said it so it's not some fringe theory). I wasn't dismissing the hypothesis but asking out of genuine curiosity, since this feels like something that can easily be tested on "small" large language models. There's lots of little experiments like this can be done with small-ish models trained on purely synthetic data (the stuff about digit multiplication was done on GPT-2 scale model IIRC). Can models learn to count? Can they learn to add? Can they learn to copy text verbatim accurately? Can they learn to recognize regular grammars, or even context-free grammars (this one has already been done, and the answer is yes). And if the answer to one of these turns out to be no, then we'd better find out sooner rather than later, since it means we probably need to rethink the architecture a bit.

I know there's a lot of theoretical CS work on deriving upper-bounds on these models from a circuit-complexity point of view, but as architectures are revised all the time it's hard to tell how much is still relevant. Nothing beats having a concrete, working example of a model that correctly parses CFGs as rebuttal to the claim that models just repeat their training data.