I'm quite skeptical of analyses like this one, because I doubt the metrics themselves. Emergence is something that is intuitively noticed by human observers. The desire to quantify everything then leads to the creation of (imperfect) metrics designed to capture what the observers already know. Those same metrics are then taken as the definition of the properties said to be emergent, and articles like this one are among the consequences of that choice.
The paper's claim is essentially "these metrics which appear to demonstrate emergence can be replaced by other metrics that also represent model behavior, but that do not have scale discontinuities, so emergence isn't a real phenomenon".
But an equally valid interpretation would be "none of these metrics actually capture the properties we are truly interested in". Which, given the complexity of what we are dealing with here, seems entirely reasonable. It's not like we suddenly learned how to accurately quantify performance at language tasks. The whole reason LLMs are so great in the first place is because traditional 'mechanical' language models suck so bad.
I think the claim that "these metrics which appear to demonstrate emergence can be replaced by other metrics that also represent model behavior, but that do not have scale discontinuities (emergence)" is really powerful.
That might mean that behaviors we consider emergent are the consequence of a process that scales continuously with model size.
i.e.: there may exist a bijection between, say, a step-function `can_do_arithmetic(size)` and a smooth, continuous function `arithmetic_skill_metric(size)`
If we can use continuous metrics to back out the step-function equivalents, that'll help us predict when and how to get particular abilities to "emerge."
For example: If a change results in a steeper slope on the continuous metric, we can predict it would cause the associated capability to emerge at relatively smaller model sizes.
> That might mean that behaviors we consider emergent are the consequence of a process that scales continuously with model size.
Or it might mean that the metrics used are worthless for describing high-level model behavior. That's my whole point. Emergent behavior was observed, which is why we want those metrics so we can try to understand what is going on. But just because we have metrics that exhibit discontinuities at scales where humans observe emergence, doesn't mean that those metrics really represent the hard-to-define behavioral changes we have observed.
I'm not sure I understand. I think you're suggesting that the metrics currently being used to assess emergence of new capabilities in LLMs are imperfect and potentially worthless, but I don't understand what is missing.
Using the example of 4 digit multiplication in the source paper: The researcher wants to know if the model has developed the ability to multiply two four-digit integers, so they generate a battery of such problems, e.g. "what is 4363*1285? output only your answer." The metric is what percentage of the problems the LLM answers correctly.
This is pretty much the same way a human observer would identify the same emergent behavior, and also how we assess it in other humans. It's not some contrived metric that's detached from the emergent ability in question.
Yes, but observed emergence goes beyond performance at individual, easy-to-define tasks.
When using GPT-2 vs GPT-3 vs GPT-4, a human can easily tell that each is leaps and bounds "better" than its predecessor, with "deeper" understanding of the input and "more human-like" responses and reasoning. There is a strong impression that those models aren't just progressing along a scale, but changing qualitatively.
I simply doubt that any of the proposed metrics captures this intuitively observable quality. Furthermore, I claim that it is this quality that actually matters. We don't need a language model in order to multiply two numbers. Any clearly defined and algorithmically solvable (and thus readily quantifiable) task is trivial for regular software.
> When using GPT-2 vs GPT-3 vs GPT-4, a human can easily tell that each is leaps and bounds "better" than its predecessor, with "deeper" understanding of the input and "more human-like" responses and reasoning. There is a strong impression that those models aren't just progressing along a scale, but changing qualitatively.
I agree with you about all of this.
> I simply doubt that any of the proposed metrics captures this intuitively observable quality. Furthermore, I claim that it is this quality that actually matters.
I think there are important qualitative observations about LLMs that create understanding and guide future research, but I think it's bad science/engineering to rely on some indescribable quality of "goodness" when it comes to developing new models.
Metrics may not be perfect, but they are all we have. You could train a model, send it to a bunch of human crowdworkers, and ask them to rate it on how good it is (Anthropic does this!), but the result of that is... another metric.
That said, yeah, I think we definitely could use better metrics. OpenAI agrees, which is why they're pushing their Evals project super hard. The paper we're commenting on agrees, which is why they're proposing alternatives to these step-function metrics.
> We don't need a language model in order to multiply two numbers. Any clearly defined and algorithmically solvable (and thus readily quantifiable) task is trivial for regular software.
A language model multiplying two numbers is the whole point of emergence.
Transformers were designed to translate between different natural languages and trained on completing sentences with some words masked out. As we scaled them bigger and bigger, we found that the same model architecture was suddenly capable of more than completing sentences: it was answering questions about high school geography (MMLU), writing computer programs, and - yes - doing basic arithmetic. New capabilities were emerging with scale.
This means that a "language model," at sufficient scale, is more than a language model. This is the closest thing we have right now to generalized AI, where one model can complete a variety of disparate tasks. The arithmetic thing is exciting and unexpected because of what it represents, and understanding it should be a priority, even if we have better non-ML ways of doing that particular task.
> Metrics may not be perfect, but they are all we have. You could train a model, send it to a bunch of human crowdworkers, and ask them to rate it on how good it is (Anthropic does this!), but the result of that is... another metric.
A metric that is an aggregate of human intuition is not the same as the usual metrics though. It's just a semi-formalized way to capture those intuitive observations, rather than trying to replace them with much-simpler piecemeal mechanical evaluations.
> but I think it's bad science/engineering to rely on some indescribable quality of "goodness" when it comes to developing new models.
IMO, bad science is what is currently happening across the entire field. Astonishing high-level behavior is being observed from models, but the tools to analyze it don't exist, so instead, people are pushing out papers at a record pace that analyze every low-level property imaginable, as if such analysis would eventually yield high-level insights.
It's okay to not know. I wish every paper dealing with LLMs would start and end with the sentence "Overall, we have no idea what is happening." Instead, we get papers that add a few numbers and then wax philosophical about how LLMs supposedly do things (I'm exaggerating here, but the gist is accurate). Not a day goes by without a new article claiming that LLMs have reached their limit or similar, while we have no clue how they even work! This is really bad science, and yes, I know that much of it is coming from non-experts, but I've seen lots and lots of experts contribute to this nonsense by making completely unjustified claims of a similar nature.
I think both of what you said is valid. I think it's important not to confuse "emergent" and "expected." Who said emergence shouldn't be expected? This paper I think more argues that nonlinear metrics creating sharp plots is "expected," but in my point of view it is still and emergent property. As you said, some metrics like accuracy are important and unreplaceable. As
> There is a strong impression that those models aren't just progressing along a scale, but changing qualitatively.
I don't think that follows. We only get to see GPT-2, 3, 4, not 2.5, 3.33, 3.95. We have no way to assess whether LLM performance is continuous or discontinuous.
It's like if you gave me 3 different cars with 50 horsepower, 100 horsepower and 250 horsepower. I'd say that these cars aren't "progressing along a scale" but "showing radical leaps in performance" but in fact top speed does scale with engine power.
I think you're right, and the reason is probably that more complex concepts are hierarchies relying on the lower-level concepts to be fully correct before the higher levels can function at all. This would create an inherent thresholding that could not be avoided by any choice of metric.
To be fair though, the paper's discussion says they are not claiming that models cannot display any emergent abilities, just that some may be mirages.
That's not at all how I would assess whether someone understood how to do multiplication – I would ask them to explain their process for performing multiplication or at least to show their work. Maybe I'm misunderstanding what you're saying, though?
If you wanted to know if someone can multiply two numbers, the simplest and quickest way is probably to ask them to multiply two numbers and see if they get it right.
Remember that, in this case, all I want to know is if they have that capability or not. Also, I don't care whether they can tell me how multiplication works or not, only if they can multiply two arbitrary numbers. (here, the abstraction to people breaks down, because people can walk you through their reasoning - whereas for an LLM, explaining how multiplication works and performing multiplication are very different tasks)
There are weaknesses to this approach: you get a binary yes/no, and you have no idea how close they got. You can't tell if they just made a small math error or if they don't even know what numbers are. Going back to the LLM setting, this is why a continuous metric is useful, compared to one that experiences step-function behavior.
> Remember that, in this case, all I want to know is if they have that capability or not. Also, I don't care whether they can tell me how multiplication works or not, only if they can multiply two arbitrary numbers.
Right, but unless you know something about their process, the only way you can determine whether they have the ability to (correctly) multiply two arbitrary four-digit numbers is to have them demonstrate multiplying every combination of four-digit numbers. One can easily imagine a system that gets most answers correct but fails certain cases (e.g. only carrying between pairs of digits).
The reason why an inexhaustive test (using only several dozen examples) works to some degree with schoolchildren is because we know something about their method: the algorithm for multiplication they've been exposed to – that they're explicitly being taught – is an algorithm that we know to be correct.
That has a very large effect on the likelihood of different failure modes and the evidence required to have a particular degree of confidence.
Furthermore, for "system 2" activities (to borrow the term from Kahneman), we can reasonably expect a person's description of their process to match the process they actually performed. (There are exceptions: people will produced incorrect post hoc explanations for their behavior under sufficient duress when "I don't know" isn't perceived as an acceptable answer.) But I'm personally not aware of any reason to believe this about LLMs. I don't know why the network's actual process for performing multiplication should have anything to do with the text it produces after the fact when asked to explain its work.
So I'm not debating that a reasonable way to guess whether a person can (correctly) multiply two numbers is to "ask them to multiply two numbers and see if they get it right"; I'm disagreeing that this works with LLMs.
> There are exceptions: people will produced incorrect post hoc explanations for their behavior under sufficient duress when "I don't know" isn't perceived as an acceptable answer.
Did Khaneman get into this in his book?
Also, as for the point of contention, I think if you tell the LLM to show its work in mathematically formal notation, it's far more likely to be able to produce correct answers (I think there was a post on here in the last week or so demonstrating that?). I think this kind of makes the comparison to humans more fair, because inside their mind humans are doing some sort of intermediate math in their head for anything beyond trivial problems, and an LLM needs to be able to speak explicitly to compete fairly with that (I speculate/propose).
Not from what I recall; this was just my own observation. (Also, in retrospect I should have made it clearer that this exception was meant to be an example more than a particular claim... "under sufficient duress" almost makes it vacuously true.)
> Also, as for the point of contention, [...]
Yes! I've heard this reported as well, multiple times. I suppose it's sort of like using a pen and some scratch paper.
What you are talking about is metacognition, which these models are specifically built to not have. And it relies on the person you are talking to being an accurate reporter, which these models are not.
The model doesn't "intend" to multiply two numbers together: it has been equipped to parrot what a person asked to multiply two numbers together might say. If we asked it how it performs multiplication, it is going to produce a process it thinks a human would claim to use to perform multiplication, but that doesn't mean it can actually apply that process.
The claimed "emergent" property is that a model that can successfully fool humans into thinking they are talking to a person quasi-magically involves becoming "good" (for some measure of "good") at the cognitive tasks humans are capable of. This paper suggests that the measures of "goodness" researchers have been using make the gains on those cognitive tasks look more dramatic than they would be if measured via linear metrics.
I suspect some of the disconnect expressed in these comments here is based on the participatory nature of being lied to by these models. The reader is a full participant in creating meaning from the output of a model. Even when the improvement in models is linear, our willingness to suspend our disbelief is not. Especially when we want to be fooled: it has to avoid anything that would jar us out of our belief rather than proactively and repeatably succeed at cognitive tasks, and that is a non-linear measure.
> Emergence is something that is intuitively noticed by human observers.
The takeaway from this paper is actually that this says more about the human observers than the LLMs themselves. Humans observe "emergence" using some loose metrics themselves. Whether we formalise the metrics in a quantitative way or we stay in the realm of humnan intuition is not important; we still use some criteria to analyse things, in the latter case these are more clear and amenable to critique, in the former case they are obscure and can evade critique by moving goalposts and ground easily.
In the end of the day, either we talk about quantification or just qualitative observations, it is the same phenomenon we observe with the same qualities. The problem is that human intuition uses a lot of discontinuous metrics; we judge if an LLM passes or fails what we ask for it, but it is harder to judge the underlying tokenisation process as itself. For this reason, and considering the findings of this paper, the observations and claims of emergence in LLMs carry less value now, imo.
Calling human intuition "just another metric" ignores the fact that human intuition performs spectacularly well at many high-level tasks.
Whatever "insight" and "understanding" actually mean, there is no denying that they are immensely useful, which is why we want AIs that can replicate them.
When trying to understand complex systems that don't yield to quantitative analysis in an obvious way, the starting point should be to assume that the intuitive evaluation is (roughly) correct, until proven otherwise. Trying to cast this intuition into a simple metric and then using that metric (or other simple metrics) to demonstrate that the intuition is wrong is circular reasoning.
> human intuition performs spectacularly well at many high-level tasks.
It's also the number one most common cognitive bias. Humans are especially prone to reification - the confusion that the construction of a measure equates to an objective reality.
Humans are often launder subjectivity through the creation of metrics without 1) knowing they've done so, and 2) become emotional when accused of having done so.
I agree that perhaps the metrics are not as useful themselves, but I think you're giving too little credit to the paper where maybe some credit is due.
I think the paper is correct that there are no "emergent abilities", i.e. abilities that might suddenly appear when scale of the model is increased. And though it might not be accurate, but the paper did make some effort to formalize and I think it is a good attempt to kind of prove the point.
However as we recognize, there are still some weird discontinuities in which at one point the model is useless and suddenly it becomes very useful. This "discontinuity" IMHO is probably just perceptional, but the underlying metric is continuous.
> Emergence is something that is intuitively noticed by human observers.
The problem, of course, is that people's intuition is particularly awful for this sort of thing. We have a very strong tendency to anthropomorphize everything, and that illusion can be quite overpowering.
There are many fundamental qualities that can be observed but not quantified. General intelligence being one of them (intelligence tests measure some aspects of it, but not others).
If everything that is real could be quantified, we wouldn't need AI. Traditional computing is already absolutely phenomenal at dealing with quantifiable systems. The whole point of wanting "artificial intelligence" is because we don't know how to quantify the high-level properties of speech, thought, intellect, and consciousness. And not for lack of trying.
Lots of things are intuitively noticed by human observers, only some of which exist. We are heuristic pattern-finding machines who believe in ESP and fairies: why wouldn't we also find mythology in the machines we build?
If by "don't believe in science" you mean "don't believe that every metric claimed to be representing a phenomenon actually does represent that phenomenon", you are correct.
The title of this paper is misleading. They are not arguing that the abilities are a mirage. They are arguing that the sudden ("emergent") appearance of unexpected abilities is not actually sudden, but gradual and predictable with model scale, if measured in an improved way.
It does mean that certain properties not found in constituents is in the greater system (pilots in game of life can't do addition but can be used to make an adder that counts values).
The mistake is arriving these abilities to the model itself, and not the content being modeled.
Text contains more data than language. Large Language Models work implicitly: they are not limited to finding language-specific patterns in the text that they model.
Humans look at LLMs through a lens of expectation. Any time we find a feature we did not expect, we categorize it after-the-fact. That's our biggest mistake: LLMs are not made of categories!
I've always taken emergence as just a word from the perspective of the beholder. It isn't anything essential to the thing itself. If you understand a complex system enough, emergence goes away and it's reductive again. But that's not to say that emergence as a concept isn't useful. It's very much about our relationship to our discoveries and how much we understand them.
So if I'm reading this halfway correctly, quality isn't suddenly emergent, it's continuous and gradual based on size of the model. It only appears emergent when researchers pick bad metrics.
I and I assumed a lot of people, already thought performance was a function on model size (# of parameters). Is this not what the prevailing thought is for DNN performance?
Agreed with the other posters that this title is misleading.
> I and I assumed a lot of people, already thought performance was a function on model size (# of parameters).
I guess the disagreement has been in whether this function is "continuous" or not.
I do not think the title is misleading, considering the article answers to quite specific claims in other articles. I agree it sounds misleading if you do not put it in that context.
I think most people expect(ed) that performance vs. size would (is) be an s-curve. The surprise for most is that we have climbed up the slopes so far and so fast. What the shape is is not clear to me.
As others have said, it's an awful title. Could instead be something like "Is the emergence aspect of Emergent Abilities in Large Language Models a Mirage?".
Like, there's supposed to be nothing academic researchers like more than re-using the same word in a title, or making it into a clever pun or quip -- it's like the Dad Jokiest subfield -- but instead we just get a title that implies one common argument that people make, and actually delivers an unpredictable different argument that seems plausible but not necessarily interesting.
This is interesting. There's another implication here. That reliability/usefulness is an "emergent" phenomenon as underlying abilities become more accurate.
It's the difference between siri not understanding you 1 word out of 10 (very accurate!), and it basically just understanding you. It's a continuous accuracy function and a discontinuous usefulness function.
OK, the system improves with scale. For some metrics which have thresholds of success, that looks like a discontinuity. But the discontinuity comes from the metric, not the improvement.
Anything measured by "winning" has this property. Small changes near the "winning" threshold result in large changes in wins. This is well known in sports.
Is there more to this issue than this amplification effect?
I had someone much knowledgable on this topic than myself claim ChatGPT and the like "understand" stuff. My standing argument is that they wouldn't hallucinate incorrect responses if they understood anything, the hallucination is when their approximation of what a real response would be falls short.
These emergent abilities are not actually that, but a result of humans' poor understanding of cognition and communication.
What concerns me very much is how the harms that can be caused by LLMs has been so greatly under reported.
I imagine being someone with the right power and access telling ChatGPT "find all people that would vote against this candidate in real time and devise ad content and social media messaging and bit interaction to change their minds or discourage them from voting" heck, any intel org of a major country is probably already working on this. No more whistleblowing or posting anonymously on social media, companies would even share models based on private email and conversations you had so other companies could use LLMs to identify everything you posted elsewhere and to have LLMs designate a score for hoe hireable you are. Police can crack down on crime better but also crack down on dissent or any police reforms.
And we aren't even talking about war time use of LLMs or what happens when you marry something like ChatGPT with Dall-E and make it all real-time.
I am warning anyone who will listen. Smartphones are the most dangerous things out there. Any service or interaction that depends on them is deteimental to peace and liberty of the masses long term. People have not learned a thing from Snowden or 2016 elections.
And why are all the smart journalists asleep on the job on this topic. Where are the unreasonable scaremongerers when you need them!
> My standing argument is that they wouldn't hallucinate incorrect responses if they understood anything, the hallucination is when their approximation of what a real response would be falls short.
This doesn't seem to make sense. If anything the opposite is true - if the things that are hallucinated make sense (even if not true) it means there is some "understanding" or a world model.
No, hallucinations are similar to but not quite the expected result which shows they are approximations.
For example, you tell Dall-E to draw a picutre of a man smoking a pipe but then it draws the pipe coming out of the mans butt and instead of a head the man has a leg on his neck. This is approximation. Now if the pipe looked wrong or it was a trumpet or the guy's head looked weird but still a head maybe it knows what a head is and it just made a mistake and it knows that the pipe goes into the mouth so it will be somewhere near the face.
Undesrstanding is a 3yr old child drawing terribly, approximarion is drawing really well but all wrong.
All this applies to LLMs I used Dall-E because pictures are easier to talk about.
I haven't used dall-e much but I've never seen stable diffusion or midjourney make an error like that, unless of course deliberately promoted.
You can see this because of the big deal people made about image gen tools getting hands wrong: it was the most significant error that was systematically occuring.
There us just something about the mistakes. Let me put it thid way, it nails down hands and faces great most of the time with stable diffusion for example, why would that ever be an issue at any point if it understood what hands and faces were? If I didn't understand what a hand was or how to draw it,I would never get it right. But if I do get it right quite a bit because I understand what the object is, then it makes no sense for me to have a significant error rate where hands and faces are deformed.
An artist who knows how to draw hands and faces will never make that mistake, especially when self-correcting is so easy.
The only explanation is that it is approximating based on what it learned from the large swath of training data.
Kind of like if a human remembered answers to a multiplication table by memorizing every input,output and trend as opposed to knowing how to process the data within the rules of math and generate output. LLMs imo don't understand the rules of language and context they only approximate what a rule compliant system should output.
As has been pointed out ad-nauseam, beginner human artists find hands very hard to draw too - and newer models aren't really making hand mistakes often.
I've never seen significant or systematic errors with faces.
It sounds a little bit like you haven't actually tried these tools. The kinds of errors you seem to think they make just aren't there in practice. I'd encourage you to try them out!
> LLMs imo don't understand the rules of language and context they only approximate what a rule compliant system should output.
This is an area I know a lot about.
There are no real universal rules for English grammar. If you look at something like the Penn treebank you can see that English - as used by humans - is more exceptions than rules. The fact that LLMs outscore any rule based system merely means that our grammar rules are mostly things derived from how English is used in practice, not vice-versa.
> My standing argument is that they wouldn't hallucinate incorrect responses if they understood anything
Alexander Grothendieck, one of the greatest mathematicians of the 20th century, answered "57" when asked in a public lecture to provide an example of a prime number. 57 = 3 * 19 is not a prime number.
According to your argument, this would imply Grothendieck did not understand what a prime number is. Which is laughable.
My argument is not that it shouldn't make a mistake but that it cannot recognize it's mistake and correct it because it would have. That mathematician I am sure recognize that 57 isn't a prime number if you ask him again.
Also mathematics is not the right field to compare this to because there are rules. In languages and object recognition/synthesis it is all subjective. Understanding here means understanding context and human-subjective interpretation.
When I say "that sunset is beautiful" you understand what I mean, ML models simply approximate based on what they see other humans do or say. I am not calling the sunset beautiful because other people are, it is my own subjective interpretation.
Good point. Within organisms there are a number of survival mechanisms even in juveniles. Braitenberg hypothesized simple mechanical vehicles that evolved survival tactics. Not understanding the latter will ultimately be a fatal flaw irrespective of any insightful understanding of other problems. If Grothendieck had known he faced execution for failing to give a correct response, he would certainly have survived. I'd be interested to see AI configurations linked to a number of direct externalities (not linked to human derived/directed information) that might then determine their fate.
Yeah, a little bit. Our biology makes us unpredictable to a certain degree. In simple terms we're allowed to have a bad or a good day. LLM's remain unchangeable and simple mistakes they make can't be explained in the same manner I think.
> My standing argument is that they wouldn't hallucinate incorrect responses if they understood anything, the hallucination is when their approximation of what a real response would be falls short.
> These emergent abilities are not actually that, but a result of humans' poor understanding of cognition and communication.
My counterargument is that humans hallucinate too, and often. As just one small example, eyewitness testimony is stupifyingily unreliable. Neurological research and even basic behavioral research shows our brains act as bullshit machines fabricating satisfying narratives constantly. Not to even get into the fact that the word hallucination still has a non-AI meaning, and that dreams exist. As I put see it GPT models simply hallucinate more often and, more noticably, in a different manner than humans. The hallucination frequency need not reach zero, only human equivalent or better , and GPT-4 is already much better than GPT-3.
I agree with everything else you said fully. "True" reasoning machines or not, society will be catastrophically destabilized. Amongst the chaos I expect plentiful of "normal" conventional and nuclear war to go on.
What's the difference between "hallucinating" and just getting something wrong?
Hallucinations are where you hear, see, smell, taste or feel things that appear to be real but only exist in your mind. Get medical help if you or someone else have hallucinations.
Why do we say LLMs hallucinate and why do people keep parroting the same thing "oh humans hallucinate", do we really hallucinate all the time? I can think of only one time a healthy adult hallucinates and it's not sitting around drinking a cup of tea.
The word has a slightly different meaning here (confabulation) and more akin to the witness example: confidently telling what you think is true, but it turned out to be complete shit.
It’s not about humans intentionally lying or truly hallucinating like on drugs. It’s about confidently thinking they are right about something which turns out not to be true.
I mean, if I say it like this is like humanity’s core business.
I think this: "It’s about confidently thinking they are right about something which turns out not to be true."
Unpopular opinion, most people who say this are actually not great listeners and don't take the time to understand peoples messages, so it sounds like "everyone is wrong or dumb", I find, nearly always, if I inquire more deeply, most peoples views or opinions are perfectly valid or they at least have good reasons for believing in something I would otherwise dismiss as false. So it's not that they're necessarily hallucinating, they're just got good reasons for having an alternate take on reality.
For an LLM, the truth is just an approximation of what it's been fed. I think for living creatures with past experience of their own, the ability to understand reality is more complex and nuanced and of course, includes their unique experiences.
My observation has been that nearly all intellectual views, opinions and "facts" are in some ways approximations. Wrong and and waiting to be revised at some stage.
> I had someone much knowledgable on this topic than myself claim ChatGPT and the like "understand" stuff. My standing argument is that they wouldn't hallucinate incorrect responses if they understood anything, the hallucination is when their approximation of what a real response would be falls short.
ChatGPT models semantic relationships in the data. That's what your smart buddy means by "understanding". That is a high dimensional model of the data set which infers abstract semantic relationships / concepts. But he would not claim that those semantic relationships are exactly the same as any human interpretation of the data (which you refer to as the "real" response).
Language models also have limited reasoning abilities. They are capable of misunderstanding as much as understanding.
When I say understanding I mean the meaning and context around it. It is one thing to know how to respond to different inputs surrounding a context, it is another to undersrand why!
Now, for all the disagreements in this thread I have not seen anyone claim it knows and reasons why things are the way they are, I don't disagree that it knows what to do for different inputs but it is not processing the input and making a reasonable decision more like mapping input with most approximate output. Is thay incorrect of me to say?
> When I say understanding I mean the meaning and context around it.
Yes it is actually modelling the meaning and context around it. ChatGPT is a ~100 layer deep neural network that was trained specifically to solve "natural language understanding tasks". The term "understanding" is an important term in NLP research along with the concept of "semantic similarity".
I believe it is more powerful and ultimately (eventually) more dangerous than you fear it is.
Based on what I have see it is making a really good approximation. Here is a thought problem: if memorized every single result of of a multiplication between two numbers up to the maximum a human can possibly think of, then do I understand multiplication?
My point about mistakes is not that they were made but the way they were made indicates it was attempting to approximate. Someone mentioned a famous mathematician who got basic multiplication wrong, now if you are wrong because you missed some steps in the process that can be proven. But if your mistake is because you guessed the that can also proven by showing defects in your answer were arbitrary.
Pretend you're grading a student, you can tell when they guessed and got it wrong as opposed to tried to follow the process bur misunderstood something or make a critical error.
> if memorized every single result of of a multiplication between two numbers up to the maximum a human can possibly think of, then do I understand multiplication?
No, ChatGPT is not simply regurgitating. It actually understands the data. It is one of the largest deep neural networks ever constructed.
Consider AlphaZero, the neural net that learned how to play Go and became the world champion. It started with nothing but the rules of the game. Deep Learning systems build a generalized model and reason about it. They are not simply regurgitating what they have already seen in their training set.
But language itself does, it's not the model, it's the data that has this ability. It's in the language patterns. Humans use that too - 99.99% of what reasoning we do is just replaying older ideas adapted in context. Being truly original and improving on the best ideas is a rare thing.
This paper is misusing the term "reasoning" in my opinion.
At no point does the LLM know that 5+6 = 11, and if asked to solve a problem in which 5+6 was an implicit component of the solution but not explicitly present in the text, it would be completely lost.
What is a reasoning task we could give an LLM that would demonstrate that it actually is not reasoning? It seems like that should be easy to construct as a very simple task outside its training set would fail utterly, but I have yet to witness one.
1a. generate two numbers using: (random() % BIGNUM)
1b. ask the LLM to multiply them together
Any human who has learned multiplication can do this. AFAIU, LLMs cannot unless the computation exists within the training set. They have zero arithmetic reasoning capability.
I just asked ChatGPT: "19191920 multipled by 10292111772"
It said:
--------------
To multiply 19191920 by 10292111772, you can use the standard long
multiplication method as follows:
19191920
x 10292111772
-------------
19191920000 (the product of 19191920 and 1)
153535360000 (the product of 19191920 and 8)
1535353600000 (the product of 19191920 and 2)
-------------
196837644266310240 (the final product)
Therefore, the result of multiplying 19191920 by 10292111772 is 196837644266310240.
----------------------------------
This is completely wrong. It is not doing arithmetic, and it is not capable of doing arithmetic.
Many humans would not be able to solve that problem, especially those that are younger or have lower IQs, and obviously those that have not not been taught multiplication. That doesn't mean they are devoid of all reasoning ability.
Furthermore GPT-4 is able to multiple slightly shorter pairs of numbers, and experiments on smaller neural nets, such as one using a dataset consisting of 20% of all combinations of two four digit numbers, quickly learn to generalize to successfully multiply any four digit pair even though they aren't in the training set.
Try again, give an answer that actually works this time.
PROMPT: I live in a location with heavy clay soil, and plan to build an adobe home. Will it be well-insulated?
RESPONSE (edited and truncated): Building an adobe home in an area with heavy clay soil can be a good option for creating a well-insulated home [... ] The insulation properties of an adobe home will depend on several factors, including the thickness of the walls, the quality of the materials used, and the construction techniques used. Adobe walls tend to be thick, which can help to provide good insulation properties. The high thermal mass of adobe can also help to regulate temperature and humidity inside the house, keeping it cool in the summer and warm in the winter.
So, here it is clear that GPT is just regurgitating information contained in its training set. Why am I so confident? It is an extremely widespread misconception, repeated almost everywhere, that adobe homes are "well insulated", and almost everywhere this is written about, you will the linkage between high thermal mass and insulation.
But this is actually completely incorrect. Adobe is a very poor insulator - a typical wall has an R factor of around 2. Ask someone who knows and you will find that "adobe is not an insulator at all" (e.g. https://www.greenhomebuilding.com/QandA/adobe/mass.htm)
So I went back to GPT to see if it "knew" about the poor R value:
--------------------
PROMPT: what is the R value of an adobe wall
RESPONSE: The R-value of an adobe wall depends on several factors, including the thickness of the wall, the density of the material, and the specific mix of materials used in the construction. Generally speaking, adobe walls have an R-value of around 0.2 to 0.5 per inch of thickness, which is relatively low compared to some other building materials.
For example, a 12-inch thick adobe wall might have an R-value of around 2.4 to 6.0, depending on the specific mix of materials used. This means that an adobe home will typically require additional insulation, such as fiberglass batts or blown-in cellulose, to achieve a higher level of thermal performance.
-------------------------------------------------
OK, so it does "know" this, and what is has generated here is factually correct, but at odds with its initial response. If it "reasoned" it would know that, given this R value, an adobe wall is pretty terrible for insulation, and would have focused on the details in this response in answering my first prompt.
Then it continues:
-----------------------------------
However, it's important to note that the insulation properties of adobe walls are not solely determined by the R-value [...]
-----------------------------------
and this final claim is completely incorrect. Insulation properties are 100% represented by R values, and anyone who actually knows this stuff would know this. It then goes on to repeat the stuff about thermal mass, which is important for how a house feels, but unrelated to its level of insulation and thus its heating requirements etc.
Now, I imagine that given all this, one could do some prompt "engineering" to get GPT to spit out something that reflects the answer that a human who actually knew and could reason about this stuff might. But I have zero doubt that what you'd actually be doing is adjusting the vocabulary to make it more likely it would base its response on e.g. the Green Building Advisor article above. I do not believe there are any prompts, or anything else, in GPT or any other LLM, that will cause it to "reason" ... hmm, let's check the R value for adobe, nope that's pretty horrible, the house will not be well insulated unless you ....
Everyone knows it has limitations. You have to work within the limitations of the model. No one has claimed that GPT is AGI. Doesn't mean it's incapable of any degree of reasoning. Yes the prompt actually matters. It was trained a specific way to solve specific tasks, and can generalize to solve tasks it has not seen before.
Try this prompt: "Taking into account the r-value of adobe, I live in a location with heavy clay soil, and plan to build an adobe home. Will it be well-insulated?"
If I knew the r-value, I wouldn't need to ask an LLM.
The sort of logic systems that were the focus of a lot of AI work before "deep learning" came along would certainly have "taken the r-value of adobe" into account (had they been exposed to such knowledge). That's because they explicitly reason about things in the world that they are trained to reason about.
Gary Marcus has been quite usefully vocal about this. We used to try to build AI systems (some still are) based on the idea that you need a world model, and you need logic and inference and relationships.
LLMs have convinced, it seems, rather a lot of people that we can just discard all that - "the system will learn the patterns all by itself".
Marcus doesn't agree, and neither do I (not that my opinion is worth much).
You have a narrow definition of reasoning. Formally and technically it is solving a symbolic reasoning task through a sequence of steps. Yes we know it's not conscious and not human reasoning.
> At no point does the LLM know that 5+6 = 11
Does it need to "know" that (by your narrow definition of "know") in order to reason about a word math problem?
> if asked to solve a problem in which 5+6 was an implicit component of the solution but not explicitly present in the text, it would be completely lost
Can you provide an example? What makes you believe it can't be trained to solve those too? That's just a higher abstraction over the language. Add more layers, more training, etc. Many humans cannot solve basic math word puzzles that this artificial neural network can already solve.
They do not "solve" word puzzles. They output text that appears to be the best response to the prompt, based on their training data. If the puzzle is solvable by doing this, then they get the answer right. If the puzzle is not solvable doing that, they are unlikely to get the answer right.
If I ask you to multiply two (largeish) numbers together, you will be able to do so, using an algorithm/process that you can apply to the multiplication of any two numbers, whether anyone has ever told you about those numbers before or not.
LLM's cannot do this. Give them a math problem that doesn't exist in their training set and they cannot solve it. This has been demonstrated many times.
> Give them a math problem that doesn't exist in their training set and they cannot solve it.
They routinely solve math problems (and other reasoning tasks) that don't exist in their training set. Examples were in that paper I linked to. This is one of the incredible emergent properties of LLMs / deep neural networks.
Try it out today on GPT-4. Make up your own math problems and go for it.
First of all, multiplying two numbers together is not "shifting the goal posts", but an absolutely basic test of any system that is claimed to able to do mathematical reasoning. I know that LLM's are not well suited for this, and that's because they cannot do arithmetic (among other things).
So I tried a word puzzle that would also require simple multiplication:
------------------------------
PROMPT: i am going to cycle 1600 miles, with 234 miles on gravel roads. on paved roads i will ride at 1929288282 millimeters per second but on gravel I will ride at 0.00000000202 parsecs per second. How long will the journey take?
-------------------------------
Now, I have to commend GPT on its ability to understand how you solve a problem like this, though that's not really very surprising given the huge numbers of such problems that exist in written materials. It precisely broke the problem down in a way that I suppose you could call "reasoning", but I would call "copying the formula for solving puzzles like this".
And how did it do with the actual math?
----------------
0.00000000202 parsecs per second is equivalent to 7499.6103827 miles per hour (mph), which we can calculate by converting parsecs to miles (1 parsec = 3.26 light-years = 19,173,511,840,000 miles) and dividing by the number of seconds in an hour:
Utterly and completely wrong. Same issue with the millimeters per second computation.
It is completely obvious why LLMs cannot do this. They cannot perform even basic arithmetic reasoning, and even more fundamentally, the ONLY capability they have is to create likely responses to prompts. For some things, this is extraordinarily (and scarily) powerful. But it is not reasoning.
You are so narrowly fixated on this one specific domain which is an edge case with LLMs.
ChatGPT was trained specifically to solve "natural language understanding tasks".
You are missing the forest for the trees and ignoring everything else it's good at solving outside of its training set.
The emergent properties of neural networks should not be so casually dismissed.
You're arguing that since its not perfect at arithmetic reasoning then its not capable of any degree of reasoning in any domain. That is an oversimplification and just doesn't make logical sense.
I do not believe that any part of any current or future LLM (i.e. using the same fundamental architecture) is capable of reasoning, or in fact, capable of anything other than, essentially, doing a really, really, really good job of generating the next word in a response.
I am all about emergent properties of neural networks, but I absolutely do not believe that LLMs have them, specifically because of the way they are designed.
However, as to the specifics, people who seem to believe otherwise claim that they can reason, and so that's merely one specific angle of attack: to show that they cannot reason, and that in fact, everything they do is implicitly contained in their training set.
As I've already said, what they can do is enormously powerful and in many (most?) ways entirely unexpected, so I still regard the advent of LLMs as extremely significant, both from a practical but also a scientific point of view. I think it may force a revision in the most basic aspects of understanding human speech behavior, for example.
Nevertheless, I do not believe that anybody is served by believing that these systems can do things that they cannot. I do not understand why the seemingly magic results of these systems is leading so many into a denial of what they actually do.
> I am all about emergent properties of neural networks, but I absolutely do not believe that LLMs have them, specifically because of the way they are designed.
Even when faced with evidence that contradicts your beliefs and proves that you are wrong?
LLMs are a type of neural network. What fundamentally prevents LLMs from having emergent abilities while other neural networks do have them?
How do you explain the emergent abilities that we have actually observed in LLMs?
Here is an example of an emergent ability of ChatGPT that you can try yourself right now.
Give it this prompt: "Write a short play where "Karen" (who behaves as the Karen meme) is on a romantic date with Hunter S. Thompson. They are eating at a Chinese restaurant."
It is able to script their interaction and dialog in a way that makes sense in the context of the setting and their personalities, including an absurd meme character. That is an emergent ability that it was not trained to do and is certainly not in its training set.
Try it. Add different characters. Ask it to rewrite the play using pirate metaphors. You can go deep into its "mind" and see the emergent abilities at play. Just apply some creativity and skip the boring arithmetic problems, as that's a well accepted weakness of this type of model.
What do you mean we don't have evidence? It has already been presented to you. You choose to reject it for reasons I truly don't understand. Search on Google Scholar if you want a more academic explanation. You can try it out on ChatGPT right now and see it yourself.
> The level of naievete around this stuff is quite incredible.
"Emergent properties" has a formal definition in the literature.
In machine learning we don't explicitly program the machine to understand anything. It automatically learns patterns in the data. That's the entire point of machine learning. With such a large neural network and training set obviously it's hard to predict all of its capabilities due to the sheer scale of it all. Of course we cannot predict exactly how it will model things.
Take this for example. No one programmed it to understand Go. It learned by itself and became the world champion. That's what deep learning is capable of.
Solving natural language understanding tasks requires reasoning, by definition. I think you are sticking to a narrow definition of reasoning that is not very technical. Formally there are many types of reasoning in AI.
But LLMs do not solve natural language understanding in any of the meanings that the phrase meant before LLMs. Instead, they throw a completely new technique at it that completely sidesteps the need for language understanding and what do you know? For the purposes of responding in meaningful, generally sensible ways, it works amazingly well. And that is incredibly cool. But it doesn't solve the (all) problem(s) that more historical approaches to machine language "understanding" were concerned with.
But there is no world representation inside an LLM, only text (words, letters) representations, so nothing the LLM does can be based on reasoning in a traditional sense.
I would wager that if we build an LLM based on a training data set collection, and then we rebuild it with a heavily edited version of the data set that explicitly excludes certain significant areas of human discourse, the LLM will be severely impaired in its apparent ability to "reason" about anything connected with the excluded areas. That sounds as if it ought to surprise you, since you think they are capable of reasoning beyond the training set. It wouldn't surprise me at all, since I do not believe that it what they are doing.
LLMs contain a model of human speech (really text) behavior that is almost unimaginably more complex than anything we've built before. But by itself that doesn't mean very much with respect to general reasoning ability. The fact that LLMs can convince you otherwise points, to me, to the richness of the training data in suitable responses to almost any prompt,suitable, that is, for the purpose of persuading you that there is some kind of reasoning occuring. But there is not. The fact that neither you nor I can really build a model (hah!) of what the LLM is actually doing doesn't change that.
> But LLMs do not solve natural language understanding in any of the meanings that the phrase meant before LLMs.
Are you saying that NLP as a field of research did not exist before LLMs? This is a continuation of research that has been in progress for decades.
> But there is no world representation inside an LLM, only text (words, letters) representations, so nothing the LLM does can be based on reasoning in a traditional sense.
Not true. The model has learned a representation of semantic relationships between words and concepts at multiple levels of abstraction. That is the entire point. That's what is was trained to do.
It's a vast and deep neural network with a very high dimensional representation of the data. Those semantic/meaning relations are automatically learned and encoded in the model.
> It's a vast and deep neural network with a very high dimensional representation of the data.
the data is text, so ...
It's a vast and deep neural network with a very high dimensional representation of *text*
And yes, to some extent, text represents the world in interesting ways. But not adequately, IMO.
If you were an alien seeking to understand the earth, starting with humans' textual encoding thereof might be a palce to start. But its inadequacies would rapidly become evident, I claim, and you would realize that you need a "vast and deep representation" of the actual planet.
> Are you saying that NLP as a field of research did not exist before LLMs? This is a continuation of research that has been in progress for decades.
Of course I'm not saying that (the first sentence). Part of my whole point is that LLMs are to NLPs as rockets are to airplanes. They're fundamentally a "rip it up and start again" approach, that discards almost everything everyone knew about NLP. The results are astounding, but the connection with, yes, "traditional" NLP is tenuous.
> Part of my whole point is that LLMs are to NLPs as rockets are to airplanes.
Yes it is deep learning applied to NLP. Makes the old designs obsolete
> the data is text
It is not randomly generated text. There are patterns in that text. It was trained to model the semantics or "meaning" in the text. There is a structure in the text which the machine has recognized.
It automatically learned a model of many concepts without any of those concepts being explicitly programmed into it. That's the entire point of machine learning.
> But not adequately, IMO.
It is adequate for some things and not adequate for other things.
It seems that all you are saying is that GPT is not AGI and doesn't have human level of understanding and reasoning. No one disagrees with that.
> It seems that all you are saying is that GPT is not AGI and doesn't have human level of understanding and reasoning. No one disagrees with that.
No, I'm saying that LLMs perform what is essentially an amazing, really hard to understand party trick that happens to be so good it is actually useful. They may be indicative (surprisingly) of how a lot of actual human speech production happens (or may not), and if so, that's really good to discover. That this is surprising represents a bit of a failure of our imagination, but it also leaves untouched the parts of speech production that don't use this sort of process.
Here's a prompt:
"If there is a bob outside, and a bib inside, and the bib cannot see the bob, can the bob see the bib?"
Answer:
"If the bib cannot see the bob due to being inside, it is likely that the bob, which is outside, also cannot see the bib. The reason for the bib not being able to see the bob is presumably due to some kind of barrier, such as a wall or a window, which would likely obstruct the view for both the bib and the bob."
Prompt:
"What type of reasoning did you just apply to solve that task?"
Answer:
"In solving the task, I applied abductive reasoning. Abductive reasoning involves inferring the most likely explanation for a set of observations or statements. In this case, I inferred that a barrier (e.g., a wall or window) is the most likely explanation for why the bib cannot see the bob, and thus, the bob likely cannot see the bib either."
ChatGPT solves the problem by a mechanical reasoning process, not a "party trick". It's a machine, not a person at a party. Surely you don't believe that it has seen that exact sequence of questions and replies in its training set and is simply regurgitating. You can argue about whether its actually using "abductive reasoning" or not, but that is besides the point that it involves some mechanical reasoning process over an interpretation of the prompt. It's not a simple regurgitation.
AlphaZero learned to play Go starting with nothing but the rules of the game. What is it regurgitating there?
Alright so deep learning, the state of the art of AI, is a "party trick". AlphaZero is likewise a party trick. No "true" reasoning involved.
> Like actual reasoning.
You're relying on intuition and personal beliefs of what constitutes "true" reasoning instead of formal rigorous mathematical definitions of reasoning. The general concept of reasoning includes what the language models are doing when they solve natural language understanding tasks, by definition.
So what I'm saying is, GPT "knows" what a cat is. It "knows" what an orange is. It has inferred these concepts from the data set.
Imagine approaching someone who is tripping on LSD and demanding they immediately solve a 10 digit multiplication problem, then saying "AHA! You cannot solve it, therefore you are incapable of any reasoning whatsoever!"
We are talking about reasoning in a general sense. There are many types of reasoning in AI which I'm sure you know how to look up and read about. "Traditional" is not one of the categories.
I don't agree that LLMS can't reason, but literally saying "Make up your own math problems and go for it", him doing that and it failing really isn't moving the goal posts.
LLMs are not good at math. But this is a subset of reasoning.
Chain-of-thought on logical inference tasks (using fake labels so we are outside the training set) shows they can do reasonably well at these.
Nevertheless, it's likely that the best approach for pure reasoning tasks will be to connect a LLM to a real inference engine (datalog or something) and rely on the LLM to perform the mapping to the inference engine inputs and outputs. This is similar to the "System 1" and "System 2" models of human thought.
People are right to doubt the claims of notOpenAI and others about the capabilities of their models. The nonlinear output gains do not mean that the quest for intelligence is over. It's already hard to steer them with RL to make proper math. It's more likely that the transformer will only be a part of the larger architecture.
Two softball players. One can hit the ball an average of 230 feet, 40% of the at bats. The other can hit the ball an average of 210 feet, 40% of the at bats. The homerun wall is 220 feet.
One is a GREAT homerun hitter. The other has a poor batting average.
The issue is that the success measure is non-linear.
Any phenomenon that is not a fundamental property of reality is a mirage, or rather a fuzzy human construct on top of a conglomeration of phenomena without discrete boundaries. And even those "fundamental" properties are suspect.
I have zero technical understanding of the math or statistics, but looking at the graphs it seems suspicious that supposed jumps happen across unrelated tasks and models at the same scales--for example, in figure 1, the discontinuities are consistently in the 10^22 to 10^24 range. Obviously I'm just going by what the authors have chosen to include, but I'd expect more variation. At best I'd assume it's something about LLMs in general.
The number of data points is tiny. There's only a handful of LLMs trained from scratch in the world, and sizes of models released in a "generation" tend to be close to each other somewhat. The field is very open source so people all over are building on top of the same shared literature. Plus I'm sure there are leaks very often and companies then rush to train their own pet architecture to whatever parameter size the competition is about to release.
I think that's just because there are only 2-3 points between 10^22 and 10^24, which is more about the data available (and that they have just seen dramatic improvements) than the measures or models themselves.
Could that be something to do with the things I keep reading about how somehow knowledge from, say, an LLM for generative text somehow carries over (in some way) to an LLM for image generation? I'm obviously not very knowledgeable in this area :).
I think that part of the reasons conclusions about emergence are tenable is due to the opaque nature of transformer architectures.
For example if it was possible to train a Hidden Markov Model with billions of hidden states on a trillion tokens, you could more literally look and see what was going on.
Other than not being able to scale HMMs to this kind of scale, is there any good reason to believe they would not perform equally well but without the magic?
Variable length markov chains would merge some states, sure, but it will still be a similar order of magnitudes.
Anything longer than 4 tokens/words of context and you bump into 30k ^ 4-10 -> cross a billion/trillion state boundary and you lose any chance of using markov chains.
Also - but here I may be wrong - there is no way to "train" markov chains to do generalisations - that is, if a given sentence didn't appear on the internet, it won't be available as a state for the chain. In this aspect they are more similar to a database than anything else.
I'm not a mathematician, but it appears to me that "emergent" properties are being defined as those which do not appear in a minor form below a threshold.
However, many natural phenomena that are fully explainable from first principles show this property, giving rise to sigmoidal "S-curves", as shown in Figure 1.
I’m sorry you find it obnoxious but emergent phenomena are everywhere in math and science and as annoying as it is to you, it also happens with AI.
The quest for more generalized models boils down to studying emergent behavior because we could never prescriptively define all the parameters/behaviors/requirements necessary for such a complex outcome. We don’t even understand how the relatively easily observed interactions between neurons in our own brains result in emergent intelligence.
What’s so impressive about LLMs is they understand the semantics of some concepts so well that they can consistently produce higher quality outputs for tasks like “explain this complicated concept with a nursery song from the perspective of a pirate” than humans could, with approximately no instances of that task in their training data. That is emergent behavior and it’s a pretty big deal.
I agree that emergent behaviors are real, and important.
I am skeptical, although not completely unconvinced, that LLMs like GPT are going to produce truly emergent phenomena, such as true first-principles logical reasoning. The limitations of the underlying transformer architecture itself are, in my opinion, the problem. The first problem is that the embedding space of the transformer needs to grow much, much larger, and it's already huge. This matters because you need to model the order of neurons in the brain. The second problem is that you're never going to train an LLM (as they're designed today) that is going to produce a truly good 'emergent-phenomena' answer without multiple network traversals. This is because the human mind constantly and autonomously refines its thoughts.
Perhaps a good counter-argument is that emergent phenomena are fundamentally a space-time domain concept.
I am aware that things like Conway's Game of Life are a fantastic counterargument to my "the transformer architecture doesn't support it" argument. But I agree that the definition of "emergent behavior" when it comes to machine learning is too easily corrupted to be novel rather than rigorous.
It gives finality to the idea that we don't understand the thing or where it came from.
Why? Did we just lose all interest in understanding things? Wasn't that the whole point in the first place?
Somehow, people are throwing up their hands and giving up at understanding the thing; yet at the same time they are acting like the thing will magically evolve into their wildest dreams!
The most fundamental feature of LLMs is that they cannot be literal. They can only infer, never define. Why is it that the people studying LLMs think they have to emulate that trait? It's like they are only allowing themselves to look at it as a mysterious black box: to infer its behavior from its results. Did they forget that they are the ones who wrote the damn thing?
The whole idea of machine learning / AI is to build functionality indirectly though, i.e. to build a system which evolves into another system over time. They are inherently meta-systems, so it does make sense to think of them differently.
Before blackbox AIs from deep learning there were basically a few different kinds of AI: one was basically “algorithm complicated enough that we thought it required intelligence”, another was “general problem solver” like you get by applying Constraint Satisfaction techniques and heuristics, and highly fine-tuned encodings of human knowledge and research (this is a decision tree clinicians use to perform a differential diagnosis of a fever, this is a function that finds edges in an image based on hand crafted CV algorithms). The first group is basically not AI, it was just assumed to be. The other two groups were fully explainable but required a ton of effort to get working outside of very tightly scoped situations. For a long time researchers thought that some combination of the two approaches would lead to more generalized models, but all attempts at morphing the two sucked ass because all knowledge had to be a hand crafted ontology of rules and atoms that could only explicitly encode relationships. Also, while computers can solve CSP/graph traversal algorithms impossibly fast compared to humans, those tasks are not good models for human cognition or tasks beyond stuff like crossword puzzles.
You should consider that despite considerable effort, human brains are themselves black boxes. And you know less about your own knowledge than you think. I do not know where I learned that Timbuktu is both a placeholder name and real place, though I could go find evidence for both. I don’t have to expend any effort to distinguish the sounds of different words, I don’t know why two things can both taste “good”but in completely different ways. Nobody ever taught me that newly met acquaintances tended to not care to discuss current events in the business world, I just figured it out based on a collection of experiences whose individual instances I can’t even remember. Even the best neuroscientist could not tell you why neurons interacting in a certain way makes it so I can both drive a car and sing, or why one person’s brain seems to better at some or generalized tasks than another’s.
And, well, deep learning overturned the paradigm of handcrafting AI systems by automating the process of “have the model produce this output from this input” without requiring humans to define the “how” beyond tuning the shape of the model, which was itself a hugely important innovation in reducing the human time required to build an AI system. But it’s not just faster to make these models, it’s so ridiculously better at making AI models for things like “is there a dog in this picture” that nobody would even consider doing those things without deep learning.
You actually can fiddle with DNNs to get an idea of how they work similar to what we do with brains and CAT scans, you have it do some stuff with commonalities and you figure out which common parts get activated. This is easy to do with convolutional layers as they very commonly learn for themselves how to perform edge detection.
Anyway, long story short, fully explainable AI utterly sucks ass at many tasks that are like a walk on the park for blackbox AI. And we cannot explain our own intelligence and knowledge except in terms of emergent phenomena, nor can we give the full provenance of some factoid or skill we have on demand (just like an LLM cannot tell you where it learned something) in many cases[0], so it seems reasonable that we’d be in the same situation with AI.
[0] The main difference is that we have memory of the various discrete experiences of our lives (which we can associate with some knowledge or skill), and there is no binary separation between “learning mode” and “doing mode” or “active memory” and “long term memory” for us like with AI. We can definitely associate some knowledge with a particular event, but this seems like it could be a false ontological representation of our knowledge because if the knowledge and event were unimportant (like what you had for breakfast on a particular day) we’d forget both of them; it’s actually all the subsequent cases in which the knowledge and memory of the event came in handy that contribute to us being able to explain it.
Most of the confusion here stems from the abuse of the word, "AI". That is the goal, and nothing else. "AI" does not (yet) exist. Every time we call something "an AI", we are telling a lie; and that lie turns the entire discussion away from logic and reason into magic and nonsense.
When we are dealing with a system that is made of logic and reason, we can use logic and reason to construct an understanding of that system. This is the explicit approach to understanding.
When we are dealing with a mysterious black box, we must take the implicit approach: using testing and inference to construct a model, we can construct an explicit understanding of that model. This is effectively the same process, but one step removed: we understand our model, not the system it applies to. That model may be incomplete and/or misaligned.
The human mind is a mysterious black box. We have made a lot of progress modeling that system, but our models are not complete or perfectly aligned.
While our study of the human mind is limited to the implicit approach, the human mind itself is capable of both implicit and explicit understanding.
So far, no software has been able to emulate that feature. Every tech that exists today uses only one of the two approaches.
> Anyway, long story short, fully explainable AI utterly sucks ass at many tasks that are like a walk on the park for blackbox AI.
What you have called "fully explainable AI" is any tech that uses the explicit approach. Because everything is explicitly defined, there is a clear place for logic to exist as part of the system. Because everything is explicitly defined, there is no room for ambiguity in the system.
What you have called "blackbox AI" is any tech that uses the implicit approach. Because nothing is literally defined, ambiguity can exist in the system. Because nothing is explicitly defined, there is no clear place for logic to exist in the system.
But is it really a black box? The program itself is explicitly defined! We should be able to use the explicit approach to understand it, just like we do any other software.
The paper's claim is essentially "these metrics which appear to demonstrate emergence can be replaced by other metrics that also represent model behavior, but that do not have scale discontinuities, so emergence isn't a real phenomenon".
But an equally valid interpretation would be "none of these metrics actually capture the properties we are truly interested in". Which, given the complexity of what we are dealing with here, seems entirely reasonable. It's not like we suddenly learned how to accurately quantify performance at language tasks. The whole reason LLMs are so great in the first place is because traditional 'mechanical' language models suck so bad.