Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>1. There's this huge misconception that LLMs are literally just memorizing stuff and repeating patterns from their training data

For a lay person, what are they actually doing instead?



They can learn to generalize patterns during training and develop some model of the world. So for example, if you were to train an LLM on chess games, it would likely develop an internal model of the chess board. Then when someone plays chess with it and gives a move like Nf3, it can use that internal model to help it reason about its next move.

Or if you ask it, "what is the capital of the state that has the city Dallas?", it understands the relations and can internally reason through the two step process of Dallas is in Texas -> the capital of Texas is Austin. A simple n-gram model may occasionally get questions like that right by a lucky guess (though usually not) while we can see experimentally the LLM is actually applying the proper reasoning to the question.

You can say this is all just advanced applications of memorizing and predicting patterns, but you would have to use a broad definition of "predicting patterns" that would likely include human learning. People who declare LLMs are just glorified auto-complete are usually trying to imply they are unable to "truly" reason at all.


I don't think anyone really knows, but I also don't think it's quite an either/or. To me a more interesting way to put the question is to ask what it would mean to say that GPT-5 is just applying patterns from its training data when it finds bugs in 1000 lines of new Rust code that were missed by multiple human reviewers. "Applying a memorized pattern" seems well-defined because it is an everyday concept but I don't think it really is well-defined. If the bug "fits a pattern" but is expressed in a different programming language, with different variable names, different context, etc., recognizing that and applying the pattern doesn't seem to me like a merely mechanical process.

Kant has an argument in the Critique of Pure Reason that reason cannot be reducible to the application of rules, because in order to apply rule A to a situation, you would need a rule B to follow for applying rule A, and a rule C for applying rule B, and this is an infinite regress. I think the same is true here: any reasonable characterization of "applying a pattern" that would succeed at reducing what LLMs do to something mechanical is vulnerable to the regress argument.

In short: even if you want to say it's pattern matching, retrieving a pattern and applying it requires something a lot closer to intelligence than the phrase makes it sound.


First: while it's not technically incorrect to say that they're learning "patterns" in the training data, the word "pattern" here is extremely deep and hides a ton of detail. These aren't simple n-grams like "if the last N tokens were ___, then ___ follows." To generate fluent conversation, new code, or poetry, the model must learn highly abstract structures that start to resemble reasoning, inference, and world-modeling. You can't predict tokens well without starting to build these higher-level capabilities on some level.

Second: Generative AI is about approximating an unknown data distribution. Every dataset - text, images, video - is treated as a sample from such a distribution. Success depends entirely on the model's ability to generalize outside the training set. For example, "This Person Does Not Exist" (https://this-person-does-not-exist.com/en) was trained on a data set of 1024x1024 RGB images. Each image can be thought of as a vector in a 1024x1024x3 = 3145728-dimensional space, and since all coefficients are in [0,1], these vectors are all in the interior of a 3145728-dimensional hypercube. But almost all points in that hypercube are going to be random noise that doesn't look like a person. The ones that do will be on a lower-dimensional manifold embedded in the hypercube. The goal of these models is to infer this manifold is from the training data, and generate a random point on it.

Third: Models do what they're trained to do. Next-token prediction is one of those things, but not the whole story. A model that literally did just memorize exact fragments would not be able to zero-shot new code examples at all. That is, the transformer architecture would have learned some nonlinear transformation that is only good at repeating exact fragments. Instead, they spend a ton of time training it to get good at generalizing to new things, and it learns whatever other nonlinear transformation makes it good at doing that instead.


The definition of a language model is literally the probability distribution of the most likely next token given a preceding text. When OP says "memorizing patterns and repeating stuff", it's a strawman of a basic n-gram model, obviously with modern language it's more advanced because we techniques like vector tokenization, but at it's core it's still just probability that's limited to the corpus it was trained on.

Or at it's core, if you give it question that it's never seen, what's the most likely reply you might get, and it will give you that. But dosen't mean there is a internal world-model or anything, it's ultimately wether you think language is sufficient to model reality, which I probably think not. It obviously would be very convincing, but not necessairly correct.


This isn't true at all. The LLMs absolutely world model and researchers have shown this many times on smaller language models.

> techniques like vector tokenization

(I assume you're talking about the input embedding.) This is really not an important part of what gives LLMs their power. The core is that you have a large scale artificial neural net. This is very different than an n-gram model and is probably capable of figuring out anything a human can figure out given sufficient scale and the right weights. We don't have that yet in practice, but it's not due to a theoretical limitation of ANNs.

> probability distribution of the most likely next token given a preceding text.

What you're talking about is an autoregressive model. That's more of an implementation detail. There are other kinds of LLMs.

I think talking about how it's just predicting the next token is misleading. It's implying it's not reasoning, not world-modeling, or is somehow limited. Reasoning is predicting, and predicting well requires world-modeling.


>This is really not an important part of what gives LLMs their power. The core is that you have a large scale artificial neural net.

What seperates transformers from LSTMs is their ability to proccess the entire corpus in parallel rather in-sequence and the inclusion of the more efficient "attention" mechanism that allows them to pick up long range dependencies across a language. We don't actually understand the full nature of the latter, but I suspect that is the basis behind the more "intelligent" actions of the LLM. There's quite a general range of problems that a long-range-dependency was encompass, but that's still ultimately limited by language itself.

But if you're talking about this being a fundamentally a probability distribution model, I stand by that, because that's literally the mathematical model (softmax for the encoder and decoder) that's being used in transformers here. It very much is generating a probability distribution over the vocabulary and just picking the highest probability (or beam search) as your next output.

>The LLMs absolutely world model and researchers have shown this many times on smaller language models.

We don't have a formal semantic definition of a "world model", I would take alot of what these researchers are writing with a grain of salt because something like that crosses more into philosophy (especially in the limits of language and logic) than hard engineering that these researchers are trained on.


This question becomes difficult whenever a system becomes sufficiently complex. Take any chaotic system, like a double pendulum, and press play at step 100,000. You ask 'what is it doing'? Well, it's just applying it's rule. Step to step.

Zoom out and look at it's trajectory over those 100,00 steps and ask again.

The answer is something alien. Probabilistically it is certain the description of its behavior is not going to exist in a space we as humans can understand. Maybe if we were god beings we could say 'No no, you see the behavior of the double pendulum isn't seemingly random, you just have to look at it like this'. Encryption is a decent analogy here.

We're fooled into thinking we can understand these systems because we forced them to speak English. Under the hood is a different story.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: