Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Forget single character tokens, you can just go on OpenAI's own tokenizer website [1] and construct tokens and ask ChatGPT to count how many tokens there are in a given string. For example hello is a single token and if I ask ChatGPT to count how many times "hello" appears in "hellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohello" or variations thereof it gets it right.

Be careful that you structure your query so that all of the "hello" are in their own token, because you could inadvertently ask it where the first or last hello gets chunked into the text just before or just after.

[1] https://platform.openai.com/tokenizer




Neat finding, does it generalize to larger samples? Someone should randomly generate a few thousand such strings, feed it to 4o or o3, and get some accuracy results. Then compare the accuracy in cases of counting individual letters in random strings.

I find there's a lot of low-hanging fruit and claims about LLMs that are easily testable, but for which no benchmarks exist. E.g. the common claim about LLMs being "unable" to multiply isn't fully accurate, someone did a proper benchmark and found that there's a gradual decline in accuracy as digit length increases past 10 digits by 10 digits. I can't find the specific paper, but I also remember there was a way of training a model on increasingly hard problems at the "frontier" (GRPO-esque?) that fixed this issue, giving very high accuracy up to 20 digits by 20 digits.


Oh that's fair. I am not actually an LLM expert so I could have some misunderstanding about this. I remember hearing this explanation given for why previous ChatGPT models failed to answer "How many "r"s are in strawberry?", but perhaps this was an over simplification.


Right that's the explanation I've heard too (and I think Karpathy even said it so it's not some fringe theory). I wasn't dismissing the hypothesis but asking out of genuine curiosity, since this feels like something that can easily be tested on "small" large language models. There's lots of little experiments like this can be done with small-ish models trained on purely synthetic data (the stuff about digit multiplication was done on GPT-2 scale model IIRC). Can models learn to count? Can they learn to add? Can they learn to copy text verbatim accurately? Can they learn to recognize regular grammars, or even context-free grammars (this one has already been done, and the answer is yes). And if the answer to one of these turns out to be no, then we'd better find out sooner rather than later, since it means we probably need to rethink the architecture a bit.

I know there's a lot of theoretical CS work on deriving upper-bounds on these models from a circuit-complexity point of view, but as architectures are revised all the time it's hard to tell how much is still relevant. Nothing beats having a concrete, working example of a model that correctly parses CFGs as rebuttal to the claim that models just repeat their training data.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: