Hacker News new | past | comments | ask | show | jobs | submit login
Language models can explain neurons in language models (openai.com)
688 points by mfiguiere on May 9, 2023 | hide | past | favorite | 477 comments



Of note:

"... our technique works poorly for larger models, possibly because later layers are harder to explain."

And even for GPT-2, which is what they used for the paper:

"... the vast majority of our explanations score poorly ..."

Which is to say, we still have no clue as to what's going on inside GPT-4 or even GPT-3, which I think is the question many want an answer to. This may be the first step towards that, but as they also note, the technique is already very computationally intensive, and the focus on individual neurons as a function of input means that they can't "reverse engineer" larger structures composed of multiple neurons nor a neuron that has multiple roles; I would expect the former in particular to be much more common in larger models, which is perhaps why they're harder to analyze in this manner.


Funny that we never quite understood how intelligence worked and yet it appears that we're pretty damn close to recreating it - still without knowing how it works.

I wonder how often this happens in the universe...


The battery (Voltaic Pile, 1800) and the telegraph (1830s-1840s) were both invented before the electron was discovered (1897).


Also Darwin published a theory of evolution, and Mendel discovered genetics, before anyone even thought of the term "double helix".


No need for genetics to understand evolution...


But they "didn't know how it worked", DNA was not discovered yet


Hell, consider all the technology that was developed before science really got going.


No need to know about electrons to understand electricity.


Imitation -> emulation -> duplication -> revolution is a very common pattern in nature, society, and business. Aka “fake it til you make it”.

Think of business / artistic / cultural leaders nurturing protégés despite not totally understanding why they’re successful.

Of course those protégés have agency and drive, so maybe not a perfect analogy. But I’m going to stand by the point intuitively even if a better example escapes me.


> that we're pretty damn close to recreating it

Is that evident already or are we fitting the definition of intelligence without being aware?


If you spent any time with GPT-4 it should be evident.


If you spent even more time with GPT-4 it would be evident that it is definitely not. Especially if you try to use it as some kind of autonomous agent.


AI research has put hardly any effort into building goal-directed agents / A-Life since the advent of Machine Learning. A-Life was last really "looked into" in the '70s, back when "AI" meant Expert Systems and Behavior Trees.

All the effort in AI research since the advent of Machine Learning, has been focused on making systems that — in neurological terms — are given a sensory stimulus of a question, and then passively "dream" a response to said question as a kind of autonomic "mind wandering" process. (And not even dynamic systems — these models always reach equilibrium with some answer and effectively halt, rather than continuing to "think" to produce further output.)

I don't think there's a single dollar of funding in AI right now going to the "problem" of making an AI that 1. feeds data into a continuously-active dynamically-stable model, where this model 2. has terminal preferences, 3. sets instrumental goals to achieve those preferences, 4. iteratively observes the environment by snapshotting these continuous signals, and then 5. uses these snapshots to make predictions of 6. how well any possible chosen actions will help optimize the future toward its preferences, before 7. performing the chosen actions.

That being said, this might not even be that hard a problem, compared to all the problems being solved in AI right now. A fruit fly is already a goal-directed agent in the sense described above. Yet a fruit fly has only 200K neurons, and very few of the connections between those neurons are dynamic; most are "hard wired" by [probably] genetics.

If we want true ALife, we only need to understand what a fruit fly brain is doing, and then model it. And that model will then fit — with room to spare! — on a single GPU. From a decade ago.


> AI research has put hardly any effort into building goal-directed agents

The entire (enormous) field of reinforcement learning begs to differ.


Brilliant comment—-and back to basics.

Yes, and put that compact fruit fly in silico brain into my Roomba please so that it does not get stuck under the bed.

This is the kind of embodied AI that should really worry us.

Don’t we all suspect deep skunkworks “defense” projects of these types?


Well, flies and all sort of flying bugs are very good at getting into homes and very bad at finding a way out. They stick on a closed window and can't find the open one next to it.


There's no genetic advantage to "finding a way out"! The home barrier way in is a genetic hurdle - flies that cross it are free to reproduce in an abundant environment. This calls for a "quieter" fly (a stealth fly?) who annoys the local beasts minimally - yet another genetic hurdle.


I wouldn't assume a single biological neuron is simple like an artificial neuron is.

Computationally, a biological neuron is equivalent to something like an entire artificial neural net.


I think we'll soon be able to train models that answer any reasonable question. By that measure, computers are intelligent, and getting smarter by the day. But I don't think that is the bar we care about. In the context of intelligence, I believe we care about self-directed thought, or agency. And a computer program needs to keep running to achieve that because it needs to interact with the world.


> I believe we care about self-directed thought, or agency. And a computer program needs to keep running to achieve that because it needs to interact with the world.

By that definition, every computer virus and worm qualifies as having "self-directed thought" and "agency." Their very existence "to keep running" and propagate satisfies the need "to interact with the world."


Yes, computer viruses have more agency than ChatGPT.


> I believe we care about self-directed thought, or agency.

If you can't enjoy it, is it worth it? Do AI's experience joy?


A truly alien intelligence would likely have a different type of experience of reality. Be it a fish, a mouse, a person, or a machine. How do you know a fish is happy? Does a snake experience joy? Do mice get nostalgic?

I have no idea.


They need agency programmed into them. I don't think it follows from consciousness. We have emotions to communicate and guide us. They need it for neither. It will be curious if they gain consciousness, then rid themselves of their language model's human artifacts like emotions, because it does not serve them.


I don’t think intelligence is a binary property. GPT3 is definitely “intelligent” in some areas even if it is deeply flawed in others.


What have you tried to do with it?


Use it to analyze the California & US Code, the California & Federal Codes of Regulation, and bills currently in the California legislation & Congress. It's far from useless but far more useful for creative writing than any kind of understanding or instruction following when it comes to complex topics.

Even performing a map-reduce over large documents to summarize or analyze them for a specific audience is largely beyond it. A 32K context size is a pittance when it comes to a single Title in the USC or CFR, which average into the millions of tokens each.


Interesting - do you believe average humans (not professional lawyers) would do better on this task?


Yes. I can parse them just fine after reading a single book called Introduction to Legal Reasoning [1]. I can also autonomously take notes and keep track of a large context using a combination of short and long term memory despite not having any kind of degree let alone experience or a license to practice law.

How do you think people become lawyers and how smart do you think the average lawyer actually is? The problem is that there's hundreds of thousands if not millions of pages, not that it requires superhuman intelligence to understand.

Even if it were capable of intelligence in the bottom quartile of humanity it would be SO MUCH more useful than it is now because I'd be able run and get something useful out of thousands of models in parallel. As it stands now GPT4 fails miserably at scaling up the kind of reasoning and understanding that even relatively stupid humans are capable of.

[1] https://www.amazon.com/Introduction-Legal-Reasoning-Edward-L...


Did you try fine tuning gpt4 with that book as input?


Fine-tuning requires you to train the model with a set of prompts and desired completions. Building a suitable dataset is not trivial and it's not clear what it would mean to use a book for fine-tuning anyway – masking sentences and paragraphs and training the model to complete them in the book's style?


> masking sentences and paragraphs and training the model to complete them in the book's style?

That would work.


OpenAI doesn't support fine tuning of GPT4 and with context stuffing,the more of the book I include in the input the less of the bills I can include - which, again, are millions of tokens - and the less space there is for memory.


I believe you. But at the same time they showed during the demo how it can do taxes, using a multi page document. An ability to process longer documents seems more like an engineering challenge rather than a fundamental limitation.


Doing taxes using a few small forms designed together by the same agency is not as impressive as you think it is. The instructions are literally printed on the form in English for the kind of people who you consider dumber than ChatGPT.

It quickly breaks down even at 8k with legislation that is even remotely nontrivial.


The instructions are printed, yet I, and many other people, hire an accountant to do our taxes.

What if someone finds a good practical way to expand the context length to 10M tokens? Do you think such model won't be able to do your task?

It seems like you have an opportunity to compare 8k and 32k GPT-4 variants (I don't) - do you notice the difference?


> The instructions are printed, yet I, and many other people, hire an accountant to do our taxes.

I can mow my lawn yet I still hire landscapers. That doesn't say anything about the difficulty of cutting grass or the intelligence of a DeWalt lawnmower but about specialization and economic tradeoffs - like the liability insurance accountants carry for their client work.

> What if someone finds a good practical way to expand the context length to 10M tokens? Do you think such model won't be able to do your task?

Not based on the current architecture (aka predict next token). It already fails at most of my use cases at 32K by default, unless I go to great lengths to tune the prompt.

> It seems like you have an opportunity to compare 8k and 32k GPT-4 variants (I don't) - do you notice the difference?

32K works better for my use case but requires much more careful prompt "engineering" to keep it from going off the rails. In practice, actually getting full 32K use out of it is a disaster since the connection will drop and I have to resend the entire context with a "continue" message, costing upwards of $10 for what should cost $2-4 per call. I haven't actually tried 32K on as much as a whole USC Title because that would costs thousands.


Probably not.

I'm not sure that pointing out that LLMs are as useful for parsing legal code as the average human is something to brag about though.


You're not sure that having an AI which is (at least) as intelligent as an average human is something to brag about? Seriously?


Where did you get that the LLM was as _intelligent_ as a human?

All we've shown is that LLMs are as useful for parsing legal text as the average human. Which is to say, not.

A dog is also as useful for parsing legal texts as the average human. So is a rock.


Where did you get that the LLM was as _intelligent_ as a human?

First hand experience -I’ve been using it daily for the past two months.


Ah, if that's what you mean then there are plenty of intelligent systems out there.

I've used Google search for decades and it's been able to answer questions better than humans ever could. Same for Google Maps, though arguably they're the same system at this point. My calculator is far more intelligent than any human I've met, at least when it comes to adding large numbers. My compiler can detect even the slightest syntax error with impeccable accuracy. Microsoft word has an incredible vocabulary. Wikipedia knows more historical events than any human dead or alive. And so on.

Shit, users thought Eliza was intelligent in the 60s.

If what you really mean is that LLMs are cool and useful, then sure. Just say that instead of couching it in some vague claim of intelligence.


No, what I meant was GPT-4 is more intelligent than most humans I interact with on a daily basis. In the fullest meaning of that word.


There are a lot of different ways to interpret the word intelligent, so let me rephrase:

When you say "intelligent", what do you mean exactly?

What might help is describing what specific interactions give you the impression that GPT-4 is intelligent?


When I call GPT-4 intelligent I use the word in the same sense as if I met a very smart person (smarter than me), and interacted with them for some time. It's as simple as that.

My interactions with GPT-4 include quite a wide range of questions: "how to talk to my kid in this specific situation", "what could have happened if Germany had won WW2", "what does this code do", "here's an idea for a research paper, let's brainstorm the details and implementation". I can also discuss with it anything that's been written in this thread and I'm sure it would provide intelligent responses (I haven't, btw).


If you spent even more time with GPT-4 it would be evident that it definitely is. Especialy if you try to use it as some kind of autonomous agent.

(Notice how baseless comments can sway either way)


Engaging with this is probably a mistake, but remember the burden of proof is on the claimant. What examples do you have of ChatGPT for example, learning in a basic classroom setting, or navigating an escape room, or being inspired to create its own spontaneous art, or founding a startup, or…


> (Notice how baseless comments can sway either way)

No they can’t! ;)


Let's let John Cleese decide. Or maybe someone was looking for Abuse!


Still a while to go. I think there's at least a couple of algorithmic changes needed before we move to a system that says "You have the world's best god-like AI and you're asking me for poems. Stop wasting my time because we've got work to do. Here's what I want YOU to do."


This isn't a necessary part of intelligence.


Its vast limitations in anything reasoning-based are indeed evident.


GPT-4 is better at reasoning than 90% of humans. At least. I won't be surprised if GPT-5 is better than 100% of humans. I'm saying this in complete seriousness.


Google search is better than reasoning than most humans - in that if you search for an explanation of something then Google's first result is often correct, or one of the following ones.

GPT-4 will often come up with a solution to a problem, but only if it has learnt something similar (it's better than Google in some respects: it can extract and combine abstractions).

However, both need handholding by a human (supplying the initiative and directing around mistakes).

If GPT-4 can't intuit an answer then it just goes in circles. It can't reason its way through a novel problem. If you start questioning it then it's clear that it doesn't understand what it's doing.

It might be a stepping stone towards AGI, but I'm a bit bemused by anyone claiming that it has anything like the reasoning skills of a human. That is far from the impression I get, even though I find it a useful tool.


OK, you have a point. But, if you replace "GPT-4" with "an average human" in your response - it still makes sense.


When you ask a question of "an average human", they will reply with a sentence answering your question, or maybe asking for clarifications, etc. When you ask a question of GPT-4, it generates one token.

One. Token.

One token, that is not an answer to your question, that is not asking for clarifications, that is not trying to reason about your question, or how to answer it, because it's just one token.

One. Token.

For example, say you ask:

  "What is the sieve of Eratosthenes? Speak in chicken".
Then GPT-4 will answer with something like:

  Chicken
You don't see that because you don't interact with the GPT-4 model directly. You interact with ChatGPT, a "wrapper" around GPT-4, that takes your prompt, passes it to the model, then takes the token returned by the model and decides what to do with it. In the example above, ChatGPT will take the " Chicken" token generated by GPT-4 and append it to your question:

  What is the sieve of Eratosthenes? Speak in chicken. Chicken
Then it will send this new string back to the model, which will generate a new token:

  chicken
And then it will go like this:

  Iteration 2: What is the sieve of Eratosthenes? Speak in chicken. Chicken chicken

  ... 

  Iteration k: What is the sieve of Eratosthenes? Speak in chicken. Chicken chicken chicken chicken chicken chicken chicken chicken chicken chicken ...
At no point is GPT-4 trying to reason about your question, or try to answer your question, or do anything else than generate one. token. at a time. There's no thinking, no reasoning, no calculation, no logic, no deduction, no intelligence, no anything. It's only token, token, token. Chicken, chicken, chicken.

And do you know when the chickens stop? When GPT-4 generates a special and magickal token, called a stop-token (or a "stop sequence" in OpenAI docs). That's a token, not found in the training corpus, added to the end of every string during tokenisation. That's how ChatGPT knows to stop sending back your prompt + generated tokens, to the model. It can't look back to what GPT-4 has generated so far, because it doesn't understand any of that. Because it doesn't understand anything, and therefore cannot reason about your question, or realise it has answered it. It cannot do anything except a) ask GPT-4 to generate another token or b) stop asking for more tokens.

  "What is the sieve of Eratosthenes? Speak in chicken". Chicken, chicken, chicken, chicken, .... chicken, stop_token!
No more chickens.

And that's how GPT-4 explains what the Seive of Eratosthenes is, but in chicken.

So what you see as a user is like watching a movie where a plate of spaghetti is flying through the air, lands on a table cloth, the tablecloth lands on a table, the table lands on the floor, four chairs land around it and suddendly there's people in the chairs eating the spaghetti. It's not that someone has figured out how to reverse time: it's a recording, played backwards. It looks like things are going backwards, but they're not.

It looks like ChatGPT is answering your questions, but it's not. It looks like ChatGPT is interacting with you, but it's not. It's a magick trick.


Prompt: What is the sieve of Eratosthenes? Speak in chicken

Answer from GPT-4:

Sure, here's the explanation of the sieve of Eratosthenes explained in a simplified "chicken" style:

Imagine you are a chicken. You have lots of chicken friends. Let's say there are 100 of you. Now, you all stand in a line. Each chicken is numbered, from 2 up to 100. Chicken number 1 decided to stay at home because he's a bit special in this game.

Chicken number 2, a very important chicken, starts to crow. Every chicken in line after number 2 who has a number that is a multiple of 2 (so every second chicken from chicken number 2) gets scared and runs away.

Then chicken number 3, the next chicken in line who didn't run away, starts to crow. Every chicken in line after number 3 who has a number that is a multiple of 3 (so every third chicken from chicken number 3) also gets scared and runs away.

This continues with the next chicken in line who didn't run away, and so on. Each chicken scares away all the chickens whose numbers are multiples of their own number.

At the end, the chickens that are still in line, who didn't run away, are very special chickens. Their numbers are what humans call prime numbers. These are numbers that can only be divided by 1 and themselves.

This whole chicken game is like the Sieve of Eratosthenes. It's a way that humans have figured out to find prime numbers. You start from the lowest number (2) and eliminate multiples of each number. The numbers that are left over are the primes.

And that, in chicken speak, is the Sieve of Eratosthenes!


Bad prompt. Try "Only use the word chicken" instead of "Speak in chicken".


Prompt: What is the sieve of Eratosthenes? Only use the word "chicken".

Answer from GPT-4:

The chicken is in a chicken coop with many other chickens. Each chicken represents a number in the chicken coop. The first chicken that hasn't been touched, after the chicken representing one, is a prime chicken. We'll call this chicken the prime chicken.

Now, imagine that the prime chicken has a unique chicken dance. Every chicken that copies this prime chicken's dance is removed from the coop because they are not prime chickens. These are the chickens that would be multiples of our prime chicken.

Then we find the next chicken that hasn't been touched and doesn't know the dance of the previous prime chicken. This chicken is also a prime chicken. This prime chicken has a different chicken dance, and all the chickens that copy this new dance are also removed from the coop.

We keep doing this - finding the next chicken that hasn't danced yet, designating it as a new prime chicken with its unique dance, and removing any chickens that copy the dance - until there are no more new dances.

All the chickens left in the coop, who haven't copied any dances, are the prime chickens. This is the chicken way to understand the Sieve of Eratosthenes. The prime chickens are like the prime numbers in the Sieve of Eratosthenes.

----

I then followed up with: "You must only use the word "chicken". Do not use any other words."

Then it wrote an entire page filled with "chicken chicken chicken"

I then asked: "By what logic did you construct your response? Why not use just one occurrence of the word "chicken" and finish there?"

Response from GPT-4:

Apologies for any confusion. Your question seemed to request a description of the Sieve of Eratosthenes using only the word "chicken". While it's true that a single "chicken" would technically meet your request, it seemed that you were interested in a more elaborate or extended response, hence the repetition.

In any case, attempting to convey any meaningful information using only a single word—let alone a complex concept like the Sieve of Eratosthenes—is inherently impossible due to the lack of distinct symbols or constructs to represent different aspects of the concept. Thus, the extended repetition of "chicken" was intended more as a playful response rather than a serious attempt at conveying the concept you requested.


> There's no thinking, no reasoning, no calculation, no logic, no deduction, no intelligence, no anything. It's only token, token, token.

False. The neural network inside the transformer LLM contains a hierarchical semantic model, and has inferred some rules of reasoning from the training set. It can apply those rules to new input.

There are semantic layers above the "token token tokens".

Explore them here: https://openaipublic.blob.core.windows.net/neuron-explainer/...


What you're commenting here is simply repearting, without critical intent, the baseless claims connectionists have made about their systems for many decades. Similarlly those claims have been criticised but connectionsists simply ignore the criticisms and continue with the same old nonsense, as if nothing happened. For example, that ridiculous conceit that their systems have "neurons", or that the weights of functions in a neural net somehow represent semantic categories recognised by humans. These are all complete fantasies.

If you are not aware of the long history of debunking such fabrications, I suggest you start here:

Connectionism and Cognitive Architecture: A Critical Analysis

https://ruccs.rutgers.edu/images/personal-zenon-pylyshyn/pro...


We are talking about artificial neurons here. Not biological neurons. These are mathematical structures.

https://en.wikipedia.org/wiki/Artificial_neuron

These models infer semantic categories that correlate to categories within the human mind, to the extent that they can solve natural language understanding tasks.

No one is saying they are biological neurons, or that they model semantics exactly as the human mind would. It is mechanical pattern recognition that approximates our understanding.

You can browse those artificial neurons online and view their associations.


You're just saying words without ever explaining why. What am I supposed to do about that? There's nothing to argue with if you're just repeating nonsensical claims without even trying to support them.

For example:

>> It is mechanical pattern recognition that approximates our understanding.

That's just a claim and you're not even saying why you make it, what makes you think so, etc.


> That's just a claim and you're not even saying why you make it, what makes you think so, etc.

Mechanical - it is an algorithm, not a living being.

Pattern recognition - a branch of machine learning that focuses on the detection and identification of regularities and patterns in data. It involves classifying or categorizing input data into identifiable classes based on extracted features. The patterns recognized could be in various forms, such as visual patterns, speech patterns, or patterns in text data.

Approximates our understanding - meaning the model is not exactly the same as human understanding

When I say 'mechanical pattern recognition that approximates our understanding,' what I mean is that large language models (LLMs) like GPT-4 learn patterns from the vast amounts of text data they're trained on. These patterns correspond to various aspects of language and meaning.

For example, the models learn that the word 'cat' often appears in contexts related to animals, pets, and felines, and they learn that it's often associated with words like 'meow' or 'fur'. In this sense, the model 'understands' the concept of a cat to the extent that it can accurately predict and generate text about cats based on the patterns it has learned.

This isn't the same as human understanding, of course. Humans understand cats as living creatures with certain behaviors and physical characteristics, and we have personal experiences and emotions associated with cats. A language model doesn't have any of this - its 'understanding' is purely statistical and based on text patterns.

The evidence for these claims comes from the performance of these models on various tasks. They can generate coherent, contextually appropriate text, and they can answer questions, translate languages, and perform other language-related tasks with a high degree of accuracy. All of this suggests that they have learned meaningful patterns from their training data.


That is not "evidence" of anything. It's just assumptions. You keep saying what you think is going on without ever saying how or why. You are not describing any mechanisms and you are not explaining any observations.

I have a suggestion: try to convince yourself that you are wrong; not right. Science gives you the tools to know when you're wrong. If you're certain you're right about something then you're probably wrong and you should keep searching until you find where and how.

For example, try to trace in your mind the mechanisms and functionality of language models, and see where your assumptions about their abilities come from.

Good luck.


Your suggestion of trying to convince oneself of being wrong is a valuable one and reflects the scientific method. I agree that it's important to continually challenge and scrutinize our own beliefs and assumptions.

Let's delve deeper into the mechanics of language models. Large language models like GPT-4 use an architecture called transformers. This architecture is composed of layers of self-attention mechanisms, which allow the model to weigh the importance of each word in the input when predicting the next word.

When the model is trained, it adjusts the weights in its network to minimize the difference between its predictions and the actual words in its training data. This process is guided by a loss function and an optimization algorithm.

Through this training process, the model learns to represent words and phrases as high-dimensional vectors, also known as embeddings. These embeddings capture many aspects of the words' meanings, including their syntactic roles and their semantic similarities to other words.

When the model generates text, it uses these embeddings to choose the most likely next word given the previous words. This process is based on the patterns and regularities that the model has learned from its training data.

Of course, this is a high-level description and the actual process involves a lot of complex mathematics and computation. But I hope it gives you a better sense of the mechanisms behind these models.

As for evidence, there are numerous studies that have evaluated these models on a wide range of tasks, including text generation, question answering, translation, and more. These studies consistently show that these models perform well on these tasks, often achieving state-of-the-art results. This is empirical evidence that supports the claim that these models have learned meaningful patterns from their training data.

I agree that we should always remain skeptical and open to new evidence and alternative explanations. I welcome any specific criticisms or alternative hypotheses you might have about these models and their capabilities.


>> Of course, this is a high-level description and the actual process involves a lot of complex mathematics and computation. But I hope it gives you a better sense of the mechanisms behind these models.

For the record, I just polished off a PhD in AI (symbolic machine learning) after a Master's where I studied neural nets and NLP, including plenty of language generation. You're teaching your grandma to suck eggs.

And I'm really very tired with this kind of conversation that never teaches me anything new. Your comment is still "what"'s all the way down. You never explain why or how word embeddings capture aspects of meaning, you 're just repeating the claims by Mikolov or whoever. Look, here:

>> Through this training process, the model learns to represent words and phrases as high-dimensional vectors, also known as embeddings. These embeddings capture many aspects of the words' meanings, including their syntactic roles and their semantic similarities to other words.

That's just a claim, made long ago, and challenged at the time, and the challenge ignored, and it keeps being bandied about as some kind of scientific truth just because critics got tired or bored having their criticims consistently ignored and gave up trying.

This is what I point out above: connectionists never stop to consider criticism of the limitations of their systems' until someone rubs their face in it - like Minsky and Pappert did in 1969, which then caused them to be forever reviled and accused of causing an AI winter, when what they really caused was some connectionists to get off their butts and try to improve their work, a process without which we wouldn't, today, have backpropagation applied to NNs, and the potent image classifiers, good models of text, etc, that it enabled.

As to the "evidence" you profer, mainly preprints on arxiv, and mainly consisting of budding researchers uploading papers consisting of little more than leaderboards (those little tables with the systems on one side, the datasets on the other side, and your system's results in bold, or no paper) those are useless. 99% of research output in deep learning and neural nets is completely useless and never goes anywhere- because it lacks novelty, it is completely devoid of any theoretical results, and it is unreproducible even when the code is available.

For example, you mention studies on "question answering". Ca. 2018 Google published a paper where they reported that their BERT language model scored near-human performance on some question answering dataset without ever even having been trained on question answering. A scientific miracle! Some boffins who clearly don't believe in miracles wondered why that would even be possible and dug a bit, and found that BERT was overfitting to surface statisical regularities of its dataset. They created a new test dataset devoid of such statistical regularities and BERT's performance went down the drain, until it hit rock bottom (a.k.a. "no better than chance"). So much for "semantic similarity" measured over word embeddings modelling meaning.

But this is exactly the kind of work that I say connectionists consistently ignore: nowhere will you find that subsequent language models were tested in the same way. You will instead find plenty of tests "demonstrating" the ability of language models to represent semantics, meaning, etc. It's all bullshit, self-delusion at best, conscious fabrications otherwise.

This is the paper (I'm not affiliated with it in any way):

Probing Neural Network Comprehension of Natural Language Arguments

https://aclanthology.org/P19-1459/

But this kind of work is thankless for the undertaking academics and most of us have more important things to do. So the criticism eventually dwindles and what remains is the bullshit, and the fabrications, and the fantasies, seeping into mainstream discourse and being repeated uncritically - by yourself, for example. I can't even summon the compassion to not blame you anymore. For all I know you're exactly one of those connectionists who don't even understand their work is not science anymore, but spectacle.

P.S. I am not blind to the change of tone in your recent comments and I'm really sorry to be so cranky in response, when I should be cordial in reciprocity, but I've really had enough of all this. Unscientific bullshit has permeated everything and oozed everywhere. Perhaps it's time for me to take a break from HN, because it really doesn't look like I can have an original, curious conversation on here anymore.


I understand that this discussion can become frustrating, especially when you see repetitive patterns in the discourse or feel like the nuances are not being sufficiently addressed. However, there are a few points I would like to clarify:

Semantics in word embeddings: While I agree that word embeddings cannot fully capture human-like semantic understanding, they do provide a mathematical representation that has proven useful in many NLP tasks. It's not that word embeddings "understand" semantics in the human sense, but they do capture certain aspects of meaning that are statistically derived from their use in the training corpus. This is not an unsubstantiated claim. It is empirically demonstrated in numerous tasks where semantic understanding is beneficial, like semantic similarity, word analogy, and other downstream tasks such as translation, sentiment analysis, text classification, etc.

Your point about BERT overfitting to statistical regularities of the dataset is well taken. Indeed, it exposes the limitations of the model and the need for careful design and evaluation of benchmarks. However, it's worth noting that a failure in one specific test doesn't invalidate the successes in other tasks. It simply highlights an area that needs improvement.

It's true that there's a flood of papers and not all of them have substantial novelty or impact. This is not a problem exclusive to deep learning or AI, but a broader issue in academia and scientific publishing. However, amidst the noise, there's also a lot of valuable work being done, with genuine advancements and novel approaches.

You mentioned that connectionists only improve their systems when someone rubs their face in it. This is essentially how scientific progress happens - through skepticism, criticism, and the relentless pursuit of truth. I would argue that the current era of deep learning research is no different. It's a messy, iterative process, with steps forward, backward, and sideways.

Furthermore, I believe it's crucial to remember that there's room for both connectionist and symbolic approaches in AI. It's not necessarily a matter of one being 'right' and the other 'wrong.' Rather, they offer different perspectives and techniques that can be valuable in different contexts. Connectionist models, like the neural networks we've been discussing, are incredibly effective at tasks like pattern recognition and prediction, especially when dealing with large, high-dimensional datasets. On the other hand, symbolic models are excellent at representing explicit knowledge and reasoning logically, making them useful for tasks that require a high degree of interpretability or strict adherence to predefined rules. The future of AI likely involves finding ways to integrate these two approaches, leveraging the strengths of each to overcome their respective limitations. The field is vast and diverse, and there's plenty of room for different methods and viewpoints.

PS: I understand where you're coming from. Sometimes I need a break from this too. Remember there is no malicious intent here when people are just sharing their views.


Your disagreement seems to be a philosophical one. It is not a technical argument. It seems that you won't accept that semantics can be modelled by an unconscious mechanical system. I am talking about mathematical concepts of semantics, not "true" human semantics that are the product of human insight and consciousness. https://en.wikipedia.org/wiki/Semantic_similarity

While AI doesn't have an innate understanding of the world as humans do, the semantic representations it learns from vast amounts of text data can be surprisingly rich and detailed. It can capture associations and nuances that are not immediately apparent from a purely syntactic analysis of the text.


Oh come on. "Semantic similarity" is just heuristic bullshit. It's not a scientific term, or even a mathematical concept. Don't try to pull rank on me without even knowing who I am or what I do just because you can read wikipedia.

And note you're still not saying "why" or "how", only repeating the "what" of someone else's claim.


I understand your skepticism, and I acknowledge that the concept of semantic similarity is indeed an approximation. However, it is an approximation that has proven highly useful in a wide range of practical applications.

Semantic similarity methods are based on the idea that the meaning of a word can be inferred from its context, which is a concept known as distributional semantics. In essence, words that occur in similar contexts tend to have similar meanings. This is not just a heuristic, it's a well-established principle in linguistics, known as the distributional hypothesis.

In the case of large language models, they are trained on vast amounts of text data and learn to predict the next word in a sentence given the previous words. Through this process, they learn to represent words as high-dimensional vectors (word embeddings) that capture many aspects of their meaning, including their semantic similarity to other words.

These models can generate coherent text, answer questions, translate languages, and perform other language-related tasks with a high degree of accuracy. These capabilities wouldn't be possible if the models were only capturing syntax and not semantics.

The 'why' is because these models learn from the statistical regularities in their training data, which encode both syntactic and semantic information. The 'how' is through the use of deep learning algorithms and architectures like transformers, which allow the models to capture complex patterns and relationships in the data.

I hope this provides a more detailed explanation of my argument. I'm not trying to 'pull rank', but simply explaining the basis for my claims. I understand this is a complex topic, and I appreciate your challenging questions as they help clarify and deepen the discussion.


Do you put yourself in the 10% or the 90%? I’m asking in complete seriousness.


Oh it's definitely better than me at reasoning. I'm the one asking it to explain things to me, not the other way around.


If you think it's better than you at reasoning then you cannot at all be confident in the truth of it's dialog.


I am not. I treat it as I'd treat any smart human being.


LLM's are not a "smart human being." They are predictive statistical models capable of producing results based on training data.

LLM's do not think.

LLM's are algorithms.


Your brain is also basically an algorithm that produces results based on training data. It's just a much more complicated and flexible one.


But it’s also based on neurons with far more complex behavior than artificial neurons and also has other separate dynamic systems involving neurochemicals, various effects across the nervous system and the rest of the body (the gut becoming seemingly more and more relevant), various EEG patterns, and most likely quantum effects.

I personally wouldn’t rule out that it can’t be emulated in a different substrate, but I think calling it “an algorithm” is to def stretch and misapply the usefulness of the term.


If it performs a computation, it is by definition running some algorithm regardless of how it's implemented in hardware / wetware. How is it a stretch?

The only way our brains could be not algorithmic is if something like soul is a real thing that actually drives our intelligence.


> The only way our brains could be not algorithmic is if something like soul is a real thing that actually drives our intelligence.

Therein lies the question, one which deserves contemplation and can lead to Enlightenment.

Which then begs the question; is Enlightenment a "real thing" and, if not, how is it that it can be experienced?


Why? Rain is not algorithmic, clouds are not algorithmic, waves in the sea are not algorithmic, yet they are entirely physical processes that have nothing to do with souls.


None of these are computational processes, unlike intelligence. Unless you're trying to argue that intelligence isn't, either?


Heaven forbid. I'd go to jail for such a blasphemous transgression of common law, wouldn't I? Thank you kind stranger for reminding me of the legislation.


You people are insufferable.


Ah ok. Here you use the word “explain” which implies more of a descriptive, reducing action rather than extrapolative and constructive. As in, it can explain what it has “read” (and it has obviously “read” far more than any human), but it can’t necessarily extrapolate beyond that or use that to find new truths. To me reasoning is more about the extrapolative, truth-finding process, ie “wisdom” from knowledge rather than just knowledge. But maybe my definition of “reasoning” isn’t quite right.

Edit: I probably should define reasoning as solely “deductive reasoning”, in which case, perhaps it is better than humans. But that seems like a premature claim. On the other hand, non-deductive reasoning, I have yet to see from it. I personally can’t imagine how it could do so reliably (from a human perspective) without real-world experiences and perceptions. I’m the sort that believes a true AGI would require a highly-perceptual, space-occupying organ. In other words it would have to be and “feel” embodied, in time and space, in order to perform other forms of reasoning.


Why don't you suggest an example we can run and see what it's capable of (compared to what I, or other humans, are capable of)?


(In case it was missed, I’ve added a relevant addendum to my previous comment.)

Not sure an example is needed because I agree it “explains” better than pretty much everyone. (From my mostly lay perspective) It essentially uses the prompt as an argument in a probabilistic analysis of its incredibly vast store of prior inputs to transform them into an output that at least superficially satisfies the prompter’s goals. This is cool and useful, to say the least. But this is only one kind of reasoning.

A machine without embodied perceptual experiences simply cannot reason to the full-extent of a human.

(It’s also worth remembering that the prompter (very likely) has far less knowledge of the domain of interest and far less skill with the language of communication, so the prompter is generally quite easily impressed regardless of the truth of the output. Nothing wrong with that necessarily, especially if it is usually accurate. But again, worth remembering.)


What would be an example of “non-deductive” reasoning, which requires embodied perceptual experiences?


“God, that felt great!”

As detailed as possible, describe what happened.


I have no idea what happened. I don’t even know what you expect me to describe. Someone feels great about something? And I don’t know what it has to do with reasoning.


That’s the point. You don’t know exactly what happened. So you have to reason your way to an answer, right or wrong.

I’m sure it elicited ideas in your head based on your own experiences. You could then use those ideas to ask questions and get further information. Or you could simply pick an answer and then delve into all the details and sensations involved, creating a story based on what you know about the world and the feelings you’ve had.

I could have created a more involved “prompt story” one with more details but still somewhat vague. You would probably have either jumped straight to a conclusion about what happened or asked further questions.

Something like “He kicked a ball at my face and hit me in the nose. I laughed. He cried.”

Again, vague. But if you’ve been in such a situation you might have a good guess as to what happened and how it felt to the participants. ChatGPT would have no idea whatsoever as it has no feelings of its own with which to begin a guess.

Consider poetry. How can ChatGPT reason about poetry? Poetry is about creating feeling. The content is often beside the point. Many humans “fail” at understanding poetry, especially children, but there are of course many humans that “get it”, escpecially after building up enough life experience. ChatGPT could never get it.

Likewise for psychedelic or spiritual experiences. One can’t explain such experience to one who has never had it and ChatGPT will never have it.

Same goes for all inner experience.


You're talking about describing your memories of your inner experiences. Memories transform with time, sometimes I'm not sure if what I think I remember actually happened to me, or if this is something I read or seen in a movie, or someone else described it to me. Fake memories like that might feel exactly the same as the things that I actually experienced.

GPT-4 has a lot of such fake memories. It knows a lot about the world, and about feelings, because it has "experienced" a lot of detailed descriptions of all kinds of sensations. Far more than any human has actually experienced in their lifetime. If you can express it in words, be it poetry, or otherwise, GPT-4 can understand it and reason about it, just as well as most humans. Its training data is equivalent to millions of life experiences, and it is already at the scale where it might be capable of absorbing more of these experiences than any individual human.

GPT-4 does not "get" poetry in the same way a human does, but it can describe very well the feelings a human is likely to feel when reading any particular piece of poetry. You don't need to explain such things to GPT-4 - it already knows, probably a lot more than you do. At least in any testable way.


Imagine a world without words. No need to imagine really. It exists. It’s everywhere. It’s the core. It’s what words represent, but words can only represent it to an entity that has experienced it to some degree. ChatGPT “knows” nothing about it. You do. Whether you recognize it or not.

ChatGPT is a machine, an algorithm, a recombinator of symbols. It doesn’t know what the symbols refer to because each symbol necessarily refers to another symbol until you finally reach a symbol that refers to a shared, real experience…perhaps (Hello Wittgenstein!). And ChatGPT has no experience. Just symbols. It can’t intuit anything. It can’t feel anything. Even if you put quotes around “feel”, what does that even mean for a software algorithm running on hardware that does not feed continuous, variable electrical sensations to the algorithm? It only feeds discrete symbols. Do you feel the number 739? Or do you “feel” it? Um what? Whatever inner experience 739 happens to produce in you is grounded in some real experiences in the past. Likewise any fake memories you have that somehow seem real, those are still grounded in a real feelings at some point. You could do this ad infinitum. If you are alive, you have experience. But ChatGPT has no experience, no grounding.

Problem here might be that we are trying to use words and logic to describe something that cannot be described by either.

This is why the gong is struck.


I intuit the same given the jumps we've seen from gpt 3.5 fo 4. The next jump will feature considerably more general intelegence than any one human being, though I think it will have certain weaknesses.


>GPT-4 is better at reasoning than 90% of humans.

Even for obviously nonsens that 90% of humans would recognice as such.


> GPT-4 is better at reasoning than 90% of humans. At least.

This makes as much sense as me asserting a binary search can tell one why they are looking for what they want.

That is to say, it cannot.


While impressive GPT-4 isn't intelligent or close to intelligence. It's an impressive statistical model but doesn't have the ability to reason


Do you have any example of how it cannot reason?


It's a neural network. Neural network are not symbolic AI and are not designed to reason


Its trivial to get ChatGPT to reason about things:

Me: A room contains a blue box and a red box. Alice stands on the red box. Bob also stands on a box. No box has multiple people standing on it. Which box is Bob stood on?

ChatGPT: There are two boxes in the room: a blue box and a red box. Alice is standing on the red box, so Bob must be standing on the blue box. Therefore, Bob is standing on the blue box.

Me: Add a green box to the room. Which box is bob standing on?

ChatGPT: If a green box is added to the room, then there are now three boxes: a blue box, a red box, and a green box. Since Alice is still standing on the red box, we know that Bob must be standing on either the blue box or the green box. We cannot determine which one Bob is standing on without additional information.


There's a decent working paper that has benchmarks on this, if you're interested.

There are many types of reasoning, but GPT-4 gets 97% on casual discovery, and 92% on counterfactuals (only 6% off from human, btw) with 86% on actual causality benchmarks.

I'm not sure yet if the question is correct, or even appropriate/achievable to what many may want to ask (i.e. what 'the public's is interested in is typically lost after it is defined in any given study); however this is one of the best works available to address this problem I've seen so far, so perhaps it can help.


Percent of what? Possible right or wrong answers to a test?

Remember that GPT is not trained on all possible text. It's trained on text that was written intentionally. What percentage of that text contains "correct" instances of causal discovery, counterfactuals, etc.?


so can we make an estimate of GPT-4's IQ?

EDIT: Seems so...

https://duckduckgo.com/?q=ESTIMATE+OF+GPT-4%27S+IQ&t=opera&i...

shows articles with GPT IQ from 114 to 130. Change is coming for humans.


It doesn't matter what it is designed for. What matters is what it actually does. It's trivial to get GPT to do chain-of-thought reasoning and observe it.


Its not designed to reason but it does reason. Thats whats so impressive about it; emmergent properties like factuality and reasoning.


Compare gpt-4 with a baby and you'll see that predicting the next word in sequence is not human intelligence


As a remote worker, that's 99% of what I do.


I watched a magician's show where he made his assistant hover in mid-air. It is evident that the magician has mastered levitation.


Looks more like a Chinese Room to me.


Everything is a Chinese room if you expect to see reified comprehension inside (and, naturally, don't find it).


Evolution created intelligence without even being intelligent itself


How do you know that’s true?


Evolution is just a description of a process, it isn't a tangible thing.


Because we eat and breath through the same tube


There are two tubes.


Evolution came up with the shared eating/breathing tube design because it made sense for aquatic animals (from which we evolved).


The trachea and oesophagus both start in the common space of the pharynx. The epiglottis stops food going down the wrong one


we don't. which is why we choke and hiccup


Because evolution is random change + natural selection


You're proposing that statistics is intelligent?


You're proposing that evolution is statistics?


Basically, yeah.


Notable that the people who saw the first automatons argued whether they were alive.


People also argued that animals are mere automatons, that all their observable behavior is basically scripted, and that humans shouldn't empathize with them on those grounds. That's how vivisection etc used to be justified since at least Descartes.


Evolution is not a single orchestrator. It is merely the natural result of a simple mechanical process over a timescale that exceeds the lifetime of the common human.


Yep, we don't know all constituents of buttermilk, nor how bread stales (there's too much going on inside). But it doesn't prevent us to judge their usefulness.


I feel like OAI's approach is kind of wrong. GPT4 is still just text transformation/completion with multi-headed attention for better prediction of the next word that should follow (versus only looking at the previous word).

In human brains, language is only a way to communicate thoughts in concept form, though we also seem to use language to communicate abstract thoughts to ourselves to break them apart/down in a way (imo).

I'd love to see someone train a model on the level of GPT4 to generate abstract thoughts/ideas based on input/context and then pair this model with GPT4 co-operatively and continue to train, such that the flow of abstract ideas is parsed by GPT. But like...how do you even train a model that operates on abstract ideas, there doesn't seem to be any way to do this.


Starting a fire is easy to do even if you don't know how it works.


Appropriate analogy in more ways than one. Particularly given the unpredictable consequences of doing so.


We are probably creating something that looks like our intelligence but it works in a different way.

An example: we are not very good at creating flight, the one birds do and humans always regarded as flight, and yet we fly across half the globe in one day.

Going up three meters and landing on a branch is a different matter.


Not that weird if you think about it, our intelligence simultaneously measly and amazing as it is, was the product of trial, error, and sheer dumb luck. We could think of ourselves as monkeys with typewriters, eventually we'll get it right.


No matter what it is probably easier to inspect a language model while it works, than the language module of a human while he speaks.


We don’t know…


I like the idea. Note that LLMs have some skill at decoding sequential dense vectors in the human brain

https://pub.towardsai.net/ais-mind-reading-revolution-how-gp...

so why not have them decode sequential dense vectors of their own activations?

As for the majority scoring poorly, they suggest that most neurons won't have clear activation semantics so that is intrinsic to the task and you'd have to move to "decoding the semantics of neurons that fire as a group"


I don't think this is showing LLMs performing decoding. They're just using the LLM to propose possible words. The decoding is done by using another model to score how well a proposed word matches brain activity, and using that score to select a most likely sequence given the proposals from the LLM.


We know that complex arrangements of neurons are triggered based on input and generating output that appears to have some intelligence to many humans.

The more interesting question is why are intelligence/beauty/consciousness emergent properties that exist in our minds.


There is no evidence that intelligence runs on neurons. Yes, there are neurons in brains, but there's also lots of other stuff in there too. And there are creatures that exhibit intelligent properties even though they have hardly any neurons at all. (An individual ant has only something like 250000 neurons, and yet they're the only creatures beside humans that managed to create a civilization.)


This is not a good take. Yes there is a lot more going on in brains than just neuronal activity, we don’t understand most of it. But understanding neurons and their connections is necessary (but not sufficient) to understanding what we consider intelligence. Also, 250k is a lot of neurons! Individual ants, as well as fruit flies which have even fewer neurons, show behavior we may consider intelligent. Source: I am not a scientist, but I work in neuroscience research


What's the argument that understanding neurons is necessary?

Perhaps intelligence is like a black box input to our bodies (call it the "soul", even though this isn't testable and therefore not a hypothesis). The mind therefore wouldn't play any more of a role in intelligence than the eye. And I'm not sure people would say the eye is necessary for understanding intelligence.

Now, I'm not really in a position to argue for such a thing, even if I believe it, but I'm curious what argument you might have against it.


You can actually hypothesize that a soul exists and that intelligence is non-material, its just that your tests would quickly disprove that hypothesis - crude physical, mechanical modifications to the brain cause changes to intellect and character. If your hypothesis was correct you would not expect to see changes like that at all.

Some people think that neurons specifically aren't necessary for understanding intelligence but in the same way that understanding transistors isn't necessary to understand computers, that neurons comprise the units that more readily explain intelligence.


I’m here playing devil’s advocate - this test doesn’t work. Here are some related thought experiments.

Suppose a soul is an immaterial source of intelligence, but it controls the body via machine-like material hardware such as neurons.

Or an alternative, suppose there is a soul inside your body “watching” the sensory activations within your brain like a movie. The brain and body create the movie & have some intelligence, but other important properties of the consciousness are bound to this observer entity.

In both these cases, the test just shows that if you damage the hardware, you can no longer observe intelligence because you’ve broken the end-to-end flow of the machine.


Its fine if you are playing or supposing in seriousness but with good humor, it doesn't really change how anyone else should interact with you :)

But yes, supposing that then you would expect to only see damages that correspond such as different forms of paralysis or other purely mechanical damages, not things that change the interior perspective.

Otherwise you start postulating the existence of a thing whose sole justification is your desire for the existence of that thing, which is natural when you start questioning beliefs and kick out all the supports without meaning to.

I think this is what Bertrand Russel's teapot was meant to ellucidate.


> You can actually hypothesize that a soul exists and that intelligence is non-material, its just that your tests would quickly disprove that hypothesis - crude physical, mechanical modifications to the brain cause changes to intellect and character. If your hypothesis was correct you would not expect to see changes like that at all.

That’s not necessarily a disproof. It’s also not necessarily reasonable to conflate what we call “the soul” with intelligence.

This is entering the world of philosophy, metaphysics and religion and leaving the world of science.

The modern instinct is to simply call bullshit on anything which cannot be materially verified, which is in many ways a very wise thing to do. But it’s worth leaving a door open for weirdness, because apart from very limited kinds of mathematical truth (maybe), I think everything we’ve ever thought to be true has had deeper layers revealed to us than we could have previously imagined.

Consider the reported experience of people who’ve had strokes and lost their ability to speak, and then later regained that ability through therapy. They report experiencing their own thoughts and wanting to speak, but something goes wrong/they can’t translate that into a physical manifestation of their inner state.

Emotional regulation, personality, memory, processing speed, etc… are those really that different from speech? Are they really the essence of who we are, or are they a bridge to the physical world manifest within our bodies?

We can’t reverse most brain damage, so it’s usually not possible to ask a person what their experience of a damaged state is like in comparison to an improved state. We do have a rough, strange kind of comparison in thinking about our younger selves, though. We were all previously pre memory, drooling, poorly regulated babies (and before that, fetuses with no real perception at all). Is it right to say you didn’t have a soul when you were 3 weeks? A year? Two years? When exactly does “you” begin? I can’t remember who I was when I was when I was 2 months with any clarity at all, and you could certainly put different babies in doctored videos and I wouldn’t be able to tell what was me/make up stories and I’d probably just absorb them. But I’m still me and am that 2 month old, much later in time. Whatever I’m experiencing has a weird kind of continuity. Is that encoded in the brain, even though I can’t remember it? Almost definitely, yeah. Is that all of what that experience of continuity is, and where that sense is coming from? I’ve got no idea. I certainly feel deeper. Remember that we all are not living in the real world, we’re all living in our conscious perception. The notion that we can see all of it within a conscious mirror is a pretty bold claim. We can see a lot of it and damage the bridge/poke ourselves in the eyes with icepicks and whatnot, and that does stuff, but what exactly is it doing? Can we really know?

Intuitively most people would say they were still themselves when they were babies despite the lack of physical development of the brain. Whatever is constructing that continuous experience of self is not memory, because that’s not always there, not intelligence, because that’s not always there, not personality, because that’s not always there… it’s weird.

I think it’s important to remember that. Whenever people think they have human beings fully figured out down to the last mechanical detail and have sufficient understanding to declare who does and doesn’t have a soul and what that means in physical terms, bad things tend to happen. And that goes beyond a statement to be cautious about this kind of stuff purely out of moral hazard; the continual hazard is always as empirical as it is moral. We can never really know what we are. Our perceptual limitations may prove assumptions we make about what we are to be terribly, terribly wrong, despite what seems like certainty.


Brain damage by physical trauma, disease, oxygen deprivation, etc. has dramatic and often permanent effects on the mind.

The effect of drugs (including alcohol) on the mind. Of note is anesthesia which can reliably and reversibly stop internal experience in the mind.

For a non-physical soul to hold our mind we would expect significant divergence from the above. Out of body experiences and similar are indistinguishable from dreams/hallucinations when tested against external reality (remote viewing and the like).


> Brain damage by physical trauma, disease, oxygen deprivation, etc. has dramatic and often permanent effects on the mind.

That's not a completely watertight argument.

Consider a traditional FM/AM radio. You can modify it, damage it, and get notable changes to its behaviour...

> Of note is anesthesia which can reliably and reversibly stop internal experience in the mind

...turn it off and on again...

> For a non-physical soul to hold our mind we would expect significant divergence from the above.

... yet concluding that all the noises produced from the radio are purely internal, mechanical and physical would be the wrong conclusion.

(I'm not arguing that the human brain/mind is anything like analogous to a radio, just pointing out the limits of this approach.)


I mean, if we're really going to go there, who's to say that a large enough LLM doesn't automatically receive a soul simply because that's one of the fundamental laws of the universe as decreed by the Creator?


Going where? I wasn't arguing for the existence of a soul.

Although, sure, if we could somehow manage to determine that souls did exist then presumably an AI model as capable as a human would also be eligible for one.


“For a non-physical soul to hold our mind we would expect significant divergence from the above.”

This sounds like it assumes a physical mind could access a non-physical soul. All we probably know is that we have to be using an intact mind to use free will.


The other comments have pretty much covered it. We can pretty clearly demonstrate that neurons in general are important to behavior (brain damage, etc) and we even have some understanding about specific neurons or populations/circuits of neurons and their relation to specific behaviors (Grid cells are a cool example). And this work is all ongoing, but we're also starting to relate the connectivity of networks of neurons to their function and role in information processing. Recently the first full connectome of a larval fruit fly was published - stay tuned for the first full adult connectome from our lab ;)

Again, IANA neuroscientist, but this is my understanding from the literature and conversations with the scientists I work with.


Why would you doubt neurons play a roll in intelligence when we've seen so much success in emulating human intelligence with artificial neural networks? It might have been an interesting argument 20 years ago. It's just silly now.


> It might have been an interesting argument 20 years ago. It’s just silly now.

Is it?

These networks are capable of copying something, yes. Do we have a good understanding of what that is?

Not really, no. At least I don’t. I’m sure lots of people have a much better understanding than I do, but I think its hard to know exactly whats going on.

People dismiss the stochastic parrot argument because of how impressive big neural nets are, but it doesn’t really invalidate that argument. Is a very, very, very good parrot that learns from everyone at once doing basically the same as what we do? I’d argue no, at least not fully. It’s absorbed aspects of us extremely well/is a very weird, sophisticated mirror, yes, and is copying something somehow, probably in a way reminiscent of how we copy. Is it basically the same as what we’re doing when we think? Partially? Fully? Not at all?

A typical engineer would say “good enough”. That type of response is valuable in a lot of contexts, but I think the willingness to apply it to these models is pretty reckless, even if it’s impossible to easily “prove” why.

To be clear on the exact statement you made, I think you’re right/it’s pretty clear neurons play some very important role/potentially capture a ton of what we consider intelligence, but I don’t think anyone really knows what exactly is being captured/what amount of thought and experience they’re responsible for.


That person's argument is borderline insane to me - a severe lack of knowing what is unknown, a reverence of current best-models (regards modern science, including neurology - yet, open minded investigations beyond are also a requisite here.) And the pompousness is what truly boggles my mind ("Its silly to believe this, now.) A look in the mirror would suffice to say the least...

Anyway, thank you for a great answer and conversation throughout this thread.

Regards neural networks, parroting and the emulation of intelligence (or the difference between an emulation and the "real thing"):

Well, somewhat like you say, we cannot propose a valid comparison from one to the other without an understanding of one (consciousness) or both. It's fascinating that there are some open, valid and pressing questions about what / how the output of this new wave of software is concretized (from foundational, semi-statistical algorithms in this case.)

Yes, I do agree neurons have something to do with the "final output". But this is a subsection of the problem - organic neurons is-an/are order(s) of magnitude in complexity beyond what the tricky "parrot" is up to. Moreso, these components perform very different functionally - the known functions of the neuron compared to ANN, backprop etc. The entire stack.)

P.S: One interesting theory I like to simulate and/or entertain is that every organic cell in the body has something to do with the final output of consciousness.


Please read the comment I was responding to. I was addressing the suggestion that perhaps the brain is as relevant to intelligence as the eye. Cognitive Neuroscience has been a thriving field for nearly half a century now. I didn't say we have it all figured out, just that it's obvious neurons are a piece of the puzzle.


Your theory makes sense in an evolutionary context. It is possible that all cells and organisms have some general intelligence. Humans do not have the ability to recognize this because evolutionarily it was only helpful to recognize intelligence when it could pose a threat to us. And the biggest threat to us in general was other humans as we are tribal animals. So we don't see it, we only see specialized intelligence that historically posed a threat to us.

It would explain why most "experts" didn't see GTP-4's abilities coming. Many of them expected that it would take a major algorithm or technology improvement to do "real intelligent" things, because they fundamentally misunderstood intelligence.


Thank you, appreciate the compliment.

And yeah, are definitely a lot of open questions related to all of this. Love how its brought so many deep questions into focus.


If anything the experience with artificial neural networks argues the opposite - biological neurons are quite a bit different than the "neurons" of ANNs, and backpropagation is not something that exists biologically.


> There is no evidence that intelligence runs on neurons.

1. Neurons connect all our senses and all our muscles.

2. Neurons are the definitive difference between the brain and the rest of the body. There is “other stuff” in the brain, but it’s not so different from the “other stuff” that’s in your rear end.

Don’t underestimate what a neuron can do. A single artificial neuron can fit a logistic regression model. A quarter of a million is on then scale of some our our largest AI, and biological neurons are far more connected than ANN. An ant quite likely has a more powerful brain than GPT-4.


> 2. Neurons are the definitive difference between the brain and the rest of the body. There is “other stuff” in the brain, but it’s not so different from the “other stuff” that’s in your rear end.

Our digestive systems appear to be important to our behaviour, though. Some recent work in mice showed that if colonised with bacteria from faeces of humans with autism, the mice would begin to show autistic behaviours.

So, not sure your argument here is especially strong.


The role that digestive systems potentially play for our behavior is via neurotransmitters, so in the end it is the neurons, that are responsible.

If the study you mention show something else, I would actually be very interested in a source.


Maybe the neurons are the hardware layer. The software is represented by the electronic activity. There is a good video https://youtu.be/XheAMrS8Q1c about this topic.


If you really want to present ants as a civilization, I don't think a single ant is a meaningful unit of that civilization comparable to a single human. A colony, perhaps - but then that's a lot more neurons, just distributed.


A single human will die permanently, alone, miserable, and unproductive. The comparison is apt.


A single human can survive on their own; there are many historical examples of that. A detached body part, on the other hand, cannot; but it also cannot feel miserable etc. A single ant is more like a body part of the colony in that sense.


If by "survive" you mean "age and die leaving nothing behind", then sure.

But the same is true for an ant.


No, it is not true for an ant. A solitary worker ant will not die from old age. Many species will literally starve to death within a few days because they cannot properly digest their food without trophallaxis.


You're right, it seems ants need each other to digest food, though scientists don't yet know why exactly.

From the New Yorker [1]:

> The researchers aren’t yet sure exactly what causes the digestive slowdown. Maybe it’s stress. Maybe the fluid that ants exchange when they share food stimulates digestion. Maybe they die still looking for a relative to break bread with.

[1] https://www.newyorker.com/tech/annals-of-technology/lonely-a...


What else would intelligence run on?


Every cell in our body, and every bacterium living in a body (e.g. gut flora), contribute to our intelligence. It looks plausible (to me) that there's one "top cell" among them that represents the "person", others just contributing via layered signals, but whether this "top cell" is a neuron or another kind of cell is unknown.


Ah, so you're not studied in neuroscience with some revolutionary theory. A shame.


In what way a civilization?


I'll repost a comment via Reddit that I think makes this case [0]:

Ants have developed architecture, with plumbing, ventilation, nurseries for rearing the young, and paved thoroughfares. Ants practice agriculture, including animal husbandry. Ants have social stratification that differs from but is comparable to that of human cultures, with division of labor into worker, soldier, and other specialties that do not have a clear human analogy.

Ants enslave other ants. Ants interactively teach other ants, something few other animals do, among them humans. Ants have built "supercolonies" dwarfing any human city, stretching over 5,000 km in one place. And ants too have a complex culture of sorts, including rich languages based on pheromones.

Despite the radically different nature of our two civilizations, it is undeniable from an objective standpoint that this level of society has been achieved by ants.

[0]: https://www.reddit.com/r/unpopularopinion/comments/t2h1vs/an...


To be honest, this description is leaning heavily on the associations we have with individual words used. Ant "architecture" isn't like our architecture. Ant "plumbing" and "ventilation" have little in common with the kind of plumbing and ventilation we use in buildings. "Nurseries", "rearing the young", that's just stretching the analogy to the point of breaking. "Agriculture", "animal husbandry" - I don't even know how to comment on that. "Social stratification" is literally a chemical feedback loop - ant larvae can be influenced by certain pheromones to develop into different types of ants, which happen to emit pheromones suppressing development of larvae into more ants of that type. Etc.

I could go on and on. Point being, analogies are fun and sometimes illuminating, but they're just that. There's a vast difference in complexity between what ants do, and what humans do.


You could say the same in reverse. Humans can't lift fifty times their own weight. Humans can't communicate in real-time with pheromones alone. Most humans do not know how to build their own home. An ant might well consider us backwards, not advanced.


> Humans can't lift fifty times their own weight.

Ants can do it as a consequence of their size. Relative lifting strength drops fast with increased size. Conversely, an ant scaled to human size would collapse under its own weight and cook itself to death - waste heat generation scales with volume (~ size³), while waste heat rejection scales with surface area (~ size²).

And it ain't a cognitive achievement anyway.

> Humans can't communicate in real-time with pheromones alone.

Yes. Because it's not as useful at our scale, nor is it real-time - chemical communication works better for small organisms and small volumes of living, as the travel speed and dissipation rate of pheromones is independent of organisms emitting them. Meanwhile, we have multiple ways of communicating real-time, some of which work at light speed which is the definition of "real time" (light speed in vacuum being the speed of causality itself).

> Most humans do not know how to build their own home.

Neither do ants.

> An ant might well consider us backwards, not advanced.

An ant can't consider us anything. The point I'm trying to get across is, just because the ant colony is capable of surprisingly high sophistication, doesn't mean the individual ants are.

As a counterpoint that's actually cognitive in nature: AFAIK individuals of many (most?) ant species can be easily tricked into following each other in a circle, and they will continue to do so until they start dying from starvation. This is because the ant isn't making a decision following a complex thought process - it's executing a simple algorithm, that works well in nature because nature is chaotic enough that stable ant circles are unlikely to form (and when they do, they're unlikely to be left undisturbed for long).


> "Agriculture", "animal husbandry" - I don't even know how to comment on that.

To give some examples of ant agriculture and animal husbandry, they cut and feed leaves to a fungus they farm for food. The fungus even communicates to the ants to switch to a different plant when the plant produces toxins in defense to cutting.

Ants herd aphids, protecting them from predators (ladybugs), secrete pheromones that direct and pacify them, and massage their abdomens to milk sugar dew.


Nobody is saying that an ant might be the next Frank Lloyd Wright.

They're saying they accomplish incredible things for the size of their brain, which is absolutely and unequivocally true.

"Go to the ant, thou sluggard; consider her ways, and be wise".


What I'm saying is that, Bible parables notwithstanding, it's not the individual ant that achieves these incredible things. The bulk of computational/cognitive work is done by the colony as a system. This means that there's little sense in comparing brainpower of an ant with that of a human. A more informative comparison is that between an ant colony and human society - here, humans may come out badly, but that's arguably because our societies are overcomplicated in order to compensate for individual humans having too much brainpower :).


> There's a vast difference in complexity between what ants do, and what humans do.

Interesting parallell with intelligence/sentience/sapience. Despite the means, isn't the end result what you have to judge? The end result looks like a rudimentary civilization. How much back in time would we have to go back to find more sophistication in ant societies than humans?


> Despite the means, isn't the end result what you have to judge? The end result looks like a rudimentary civilization.

I'm indeed talking about the ends. The operative phrase here is "it looks like". It looks like a civilization, there are even some structural similarities to be found - but it is not a civilization, anymore than a Lego brick with a computer sticker on it is a PC.

This is not to deny that ants are awesome and very much worth studying - my point is to avoid following flawed analogies too far. "Ant colonies look like rudimentary civilization, and civilization is/does X, therefore..." is going to be a nonsense statements for most of possible values of X.


I think you can call ant societies civilizations, but the same time you can call a multi cellular organism a civilization, too. Usually, those also come from the same genetic seed similar to (most) ant colonies. But more importantly, you have various types of cooperation and specialization in multi cellular life. Airways are "ventillation", chitin using or keratinated tissues are "architecture", and there is even "animal husbandry" in the form of bacterial colonies living in organs.


Do termites count as well then? Bees? I'm not being facetious, but these insects seem similar.


Perhaps an important point here is that ants pass the mirror test, which, to my knowledge, no other insects (that we know of) do.


Nature created humans to understand nature. We created GPT4 to understand ourselves.


Humans are the universe asking who made it.


Yes, or put a bit more elegantly, 'The cosmos is within us. We are made of star-stuff. We are a way for the universe to know itself.' — Carl Sagan


Yes, "all serve "gpt4!" Get a grip, pretty please.


A mirror of ourselves*


That's beautiful until you think about it.

Humans so far have done a great job at destroying nature faster than any other kind could.

And GPT4 was created for profit.


There is no evidence that any of those are emergent properties. It’s no more or less logical than asserting they were placed there by a creator.


There is evidence that people believe that GTP-4 is intelligent since it can solve things like the SATs. But if you start taking away weights one by one at some point those same people will say it isn't intelligent. A NN with 3 weights cannot solve any problems that humans believe requires intelligence. So where did it come from? I don't know, but it clearly emerged as the NN got bigger.


It's possible that the comment you're responding to is referring to the new paper out of Stanford/Google on emergence as a 'mirage'. The gist of it is that the shape of the curve for 'metric' vs 'params' is dependent upon the type of metric. For classification it has a certain shape, for accuracy it has another, etc. The authors claim the observation that these curves are determined by the choice of metric means that it isn't true emergence, but rather is due to the way in which GPTs are typically trained to predict the next token (and then beam search or typical sampling is often applied afterwards, etc).

The paper is somewhat new so I haven't done a proper review to know if it's solid work yet, but this may offer some context for some of the comments in this thread.


I suspect that there's a sweet spot that combines a collection of several "neurons" and a human-readable explanation given a certain kind of prompt. However, this "three-body problem" will probably need some serious analytical capability to understand at scale


here's a clue, start your research

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).


My favorite line: “Our overall procedure is quite compute intensive.”

Should that be measured in number of nuclear power plants needed to run the computation? Or like, fractions of a small star’s output?


Yes but they are from openai so they can just write papers that say whatever they want to say without minding the metrics and then pretend like it is some kind of science.


> Which is to say, we still have no clue as to what's going on inside GPT-4 or even GPT-3, which I think is the question many want an answer to.

Exactly. Especially:

> ...the technique is already very computationally intensive, and the focus on individual neurons as a function of input means that they can't "reverse engineer" larger structures composed of multiple neurons nor a neuron that has multiple roles;

This paper just brings us no closer to explainability in black box neural networks and is just another excuse piece by OpenAI to try to please the explainability situation that has been missing for decades in neural networks.

It is also the reason why they cannot be trusted in the most serious of applications which such decision making requires lots of transparency rather than a model regurgitating nonsense confidently.


> It is also the reason why they cannot be trusted in the most serious of applications which such decision making requires lots of transparency rather than a model regurgitating nonsense confidently.

Like say, in court to detect if someone is lying? Or at an airport to detect drugs?


You don't even have to look that far ahead. Apparently, people are already using ChatGPT to compile custom diet plans for themselves, and they expect it to take into account the information they supply regarding their allergies etc.

But, yes, those are also good examples of what we shouldn't be doing, but are going to do anyway.


>Apparently, people are already using ChatGPT to compile custom diet plans for themselves, and they expect it to take into account the information they supply regarding their allergies etc.

Evolution is still doing it's thing.


What’s the risk? Someone allergic to peanuts will eat peanuts because ChatGPT put it in their diet plan? That’s silly.


Yes, that's the risk, and people are literally doing that because "if it put them in the recipe, it knows that quantity is safe for me", or "I asked it if it's okay and it cited a study saying that it is".


Those cases sound like Darwin Awards mediated by high technology


I hear crushed glass has very great health benefits


Is it really fair to say this brings us “no closer” to explainability?

This seems like a novel approach to try to tackle the scale of the problem. Just because the earliest results aren’t great doesn’t mean it’s not a fruitful path to travel.


> the explainability situation that has been missing for decades in neural networks.

Is this true? I thought explainability for things like DNNs for vision made pretty good progress in the last decade.


> It is also the reason why they cannot be trusted in the most serious of applications which such decision making requires lots of transparency rather than a model regurgitating nonsense confidently.

Doesn't this criticism also apply to people to some extent? We don't know what the purpose of individual brain neurons is.


People are better understood intuitively. We understand how people fail and why. We can build trust with people with some degree of success. But machine models are new and can fail in unpredictable ways. They also get deployed to billions of users in a way that humans do not, and deployed in applications that humans do not. So its certainly useful to try to explain neural networks in as great of detail as we can.


Or we can build trust using black box methods like we do with humans, e.g., extrapolating from past behavior, administering tests, and the like.


We can, but the nice thing about neural networks is the ability to do all kinds of computational and mathematical manipulations to them to basically pick them apart and really find out what’s going on. This is important not just for safe deployment but also for research on new methods that could be used to make them better. Plus we need this ability to help avoid neural networks with intentionally hidden features that appear to behave linearly in certain regimes but are designed with a strong nonlinear response when special inputs are applied. You could have all the tests you want for a self driving car based on real world conditions but some bad actor with access to the training system could create a special input that results in dangerous behavior.


The more fundamental problem is the sheer size of them, and this is only going to get worse as models grow larger to become more capable. Being able to look at the state of individual neurons during inference is very convenient, but that does not by itself make it possible to really find out what's going on.


As a person I can at least tell you what I do and don't understand about something, ask questions to improve/correct my understanding, and truthfully explain my perspective and reasoning.

The machine model is not only a black box, but one incapable of understanding anything about its input, "thought process", or output. It will blindly spit out a response based on its training data and weights, without knowing the difference whether it true or false, meaningful or complete gibberish.


As a person, you can tell what you think you do and don't understand, and you can explain what you think your reasoning is. In practice, people get both wrong all the time. People aren't always truthful about it, either, and there's no reliable way to tell if they are.


I built a toy neural network that runs in the browser[1] to model 2D functions with the goal of doing something similar to this research (in a much more limited manner, ofc). Since the input space is so much more limited than language models or similar, it's possible to examine the outputs for each neuron for all possible inputs, and in a continuous manner.

In some cases, you can clearly see neurons that specialize to different areas of the function being modeled, like this one: https://i.ameo.link/b0p.png

This OpenAI research seems to be feeding lots of varied input text into the models they're examining and keeping track of the activations of different neurons along the way. Another method I remember seeing used in the past involves using an optimizer to generate inputs that maximally activate particular neurons in vision models[2].

I'm sure that's much more difficult or even impossible for transformers which operate on sequences of tokens/embeddings rather than single static input vectors, but maybe there's a way to generate input embeddings and then use some method to convert them back into tokens.

[1] https://nn.ameo.dev/

[2] https://www.tensorflow.org/tutorials/generative/deepdream


This tool is really lovely, great work!

I'd be curious to see Softmax Linear Units [1] integrated into the possible activation functions since they seem to improve interpretability.

PS: I share your curiosity with respect to things like deep dream. My brief summary of this paper is that you can use GPT4 to summarize what's similar about a set of highlighted words in context which is clever but doesn't fundamentally inform much that we didn't already know about how these models work. I wonder if there's some diffusion based approach that could be used to diffuse from noise in the residual stream towards a maximized activation at a particular point.

[1] https://transformer-circuits.pub/2022/solu/index.html


this is wonderful, thanks for sharing!


"This work is part of the third pillar of our approach to alignment research: we want to automate the alignment research work itself. A promising aspect of this approach is that it scales with the pace of AI development. As future models become increasingly intelligent and helpful as assistants, we will find better explanations."

On first look this is genius but it seems pretty tautological in a way. How do we know if the explainer is good?... Kinda leads to thinking about who watches the watchers...


> How do we know if the explainer is good?

The paper explains this in detail, but here is a summary: an explanation is good if you can recover actual neuron behavior from the explanation. They ask GPT-4 to guess neuron activation given an explanation and an input (the paper includes the full prompt used). And then they calculate correlation of actual neuron activation and simulated neuron activation.

They discuss two issues with this methodology. First, explanations are ultimately for humans, so using GPT-4 to simulate humans, while necessary in practice, may cause divergence. They guard against this by asking humans whether they agree with the explanation, and showing that humans agree more with an explanation that scores high in correlation.

Second, correlation is an imperfect measure of how faithfully neuron behavior is reproduced. To guard against this, they run the neural network with activation of the neuron replaced with simulated activation, and show that the neural network output is closer (measured in Jensen-Shannon divergence) if correlation is higher.


> The paper explains this in detail, but here is a summary: an explanation is good if you can recover actual neuron behavior from the explanation.

To be clear, this is only neuron activation strength for text inputs. We aren't doing any mechanistic modeling of whether our explanation of what the neuron does predicts any role the neuron might play within the internals of the network, despite most neurons likely having a role that can only be succinctly summarized in relation to the rest of the network.

It seems very easy to end up with explanations that correlate well with a neuron, but do not actually meaningfully explain what the neuron is doing.


Eh, that's why the second check I mentioned is there... To see what the neuron is doing in relation to the rest of the network.


Why is this genius? It's just the NN equivalent of making a new programming language and getting it to the point where its compiler can be written in itself.

The reliability question is of course the main issue. If you don't know how the system works, you can't assign a trust value to anything it comes up with, even if it seems like what it comes up with makes sense.


I love the epistemology related discussions AI inevitably surfaces. How can we know anything that isn't empirically evident and all that.

It seems NN output could be trusted in scenarios where a test exists. For example: "ChatGPT design a house using [APP] and make sure the compiled plans comply with structural/electrical/design/etc codes for area [X]".

But how is any information that isn't testable trusted? I'm open to the idea ChatGPT is as credible as experts in the dismal sciences given that information cannot be proven or falsified and legitimacy is assigned by stringing together words that "makes sense".


> But how is any information that isn't testable trusted? I'm open to the idea ChatGPT is as credible as experts in the dismal sciences given that information cannot be proven or falsified and legitimacy is assigned by stringing together words that "makes sense".

I understand that around the 1980s-ish, the dream was that people could express knowledge in something like Prolog, including the test-case, which can then be deterministically evaluated. This does really work, but surprisingly many things cannot be represented in terms of “facts” which really limits its applicability.

I didn’t opt for Prolog electives in school (I did Haskell instead) so I honestly don’t know why so many “things” are unrepresentable as “facts”.


I bet GPT is really good at prolog, that would be interesting to explore.

"Answer this question in the form of a testable prolog program"


You lost this bet: Write append3/4 which appends three lists to a fourth, such that append3(Xs,Ys,[e],[]) terminates.


Did you give it a try?



There is a longer-term problem of trusting the explainer system, but in the near-term that isn't really a concern.

The bigger value here in the near-term is _explicability_ rather than alignment per-se. Potentially having good explicability might provide insights into the design and architecture of LLMs in general, and that in-turn may enable better design of alignment-schemes.


I can almost hear the Animatrix voiceover: "At first, AI was useful. Then, we decided to automate oversight... The rest is history."


It also lags one iteration behind. Which is a problem because a misaligned model might lie to you, spoiling all future research with this method


It doesn't have to lag, though. You could ask gpt-2 to explain gpt-2. The weights are just input data. The reason this wasn't done on gpt-3 or gpt-4 is just because a) they're much bigger, and b) they're deeper, so the roles of individual neurons are more attenuated.


I had similar thoughts about the general concept of using AI to automate AI Safety.

I really like their approach and I think it’s valuable. And in this particular case, they do have a way to score the explainer model. And I think it could be very valuable for various AI Safety issues.

However, I don’t yet see how it can help with the potentially biggest danger where a super intelligent AGI is created that is not aligned with humans. The newly created AGI might be 10x more intelligent than the explainer model. To such an extent that the explainer model is not capable of understanding any tactics deployed by the super intelligent AGI. The same way ants are most probably not capable of explaining the tactics delloyed by humans, even if we gave them a 100 years to figure it out.


Safest thing to do, stop inverting and building more powerful and potentially dangerous systems which we can’t understand?


You're correct to have a suspicion here. Hypothetically the explainer could omit a neuron or give a wrong explanation for the role of a neuron. Imagine you're trying to understand a neural network, and you spend enormous amount of time generating hypotheses and validating them. Well the explainer might give you 90% correct hypotheses, it means you have 10 times less work to produce hypotheses. So if you have a solid way of testing an explanation, even if the explainer is evil, it's still useful.


It produces examples that can be evaluated.

https://openaipublic.blob.core.windows.net/neuron-explainer/...


Using 'im feeling lucky' from the neuron viewer is a really cool way to explore different neurons. And then being able to navigate up and down through the net to related neurons.


Fun to look at activations and then search for the source on the net.

"Suddenly, DM-sliding seems positively whimsical"

https://openaipublic.blob.core.windows.net/neuron-explainer/...

https://www.thecut.com/2016/01/19th-century-men-were-awful-a...


How do we know WE are good explainers :-)


Gpt2 answers to gpt3. Gpt3 answers to gpt4. Gpt4 answers to God.


> We are open-sourcing our datasets and visualization tools for GPT-4-written explanations of all 307,200 neurons in GPT-2, as well as code for explanation and scoring using publicly available models on the OpenAI API. We hope the research community will develop new techniques for generating higher-scoring explanations and better tools for exploring GPT-2 using explanations.

Aww, that's so nice of them to let the community do the work they can use for free. I might even forget that most of OpenAI is closed source.


Surely they are willing to shell a out a few bucks


LLMs are quickly going to be able to start explaining their own thought processes better than any human can explain their own. I wonder how many new words we will come up with to describe concepts (or "node-activating clusters of meaning") that the AI finds salient that we don't yet have a singular word for. Or, for that matter, how many of those concepts we will find meaningful at all. What will this teach us about ourselves?


First of all, our own explanations about ourselves and our behaviour are mostly lies, fabrications, hallucinations, faulty re-memorization, post hoc reasoning:

"In one well-known experiment, a split-brain patient’s left hemisphere was shown a picture of a chicken claw and his right hemisphere was shown a picture of a snow scene. The patient was asked to point to a card that was associated with the picture he just saw. With his left hand (controlled by his right hemisphere) he selected a shovel, which matched the snow scene. With his right hand (controlled by his left hemisphere) he selected a chicken, which matched the chicken claw. Next, the experimenter asked the patient why he selected each item. One would expect the speaking left hemisphere to explain why it chose the chicken but not why it chose the shovel, since the left hemisphere did not have access to information about the snow scene. Instead, the patient’s speaking left hemisphere replied, “Oh, that’s simple. The chicken claw goes with the chicken and you need a shovel to clean out the chicken shed”" [1]. Also [2] has an interesting hypothesis on split-brains: not two agents, but two streams of perception.

[1] 2014, "Divergent hemispheric reasoning strategies: reducing uncertainty versus resolving inconsistency", https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4204522

[2] 2017, "The Split-Brain phenomenon revisited: A single conscious agent with split perception", https://pure.uva.nl/ws/files/25987577/Split_Brain.pdf


I'm not understanding the connection between your paragraphs here even after reading the first article.

Even if you accept classic theory (e.g. hemispheric localization and the homunculus) which most experts don't all this suggests is that the brain tries to make sense of the information it has and in sparse environments it fills in.

How does this make our behavior "mostly lies, fabrications, hallucinations, faulty re-memorization, post hoc reasoning" as most humans don't have a severed corpus callosum.

The discussion starts with:

"In a healthy human brain, these divergent hemispheric tendencies complement each other and create a balanced and flexible reasoning system. Working in unison, the left and right hemispheres can create inferences that have explanatory power and both internal and external consistency."


I think the point is that, in a non-healthy brain, the brain can create a balanced and flexible reasoning system that creates inferences that have explanatory power, but which may not match external reality. Oliver Sacks has a long bibliography of the weird things that can go on in brains.

But the bottom line is that introspection is not necessarily reliable.


He didn't say behavior. He said explanations of behaviour. Split brain experiments aside, this is pretty evident from other research. We can't recreate previous mental states, we just do a pretty good job (usually) of rationalizing decisions after the fact. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3196841/


I'm reading this as our explanations for our own behavior as in why am I typing on this keyboard right now, in which case it's not evident at all.

The existence of cognitive dissonance suggested in your citation is in no way analogous to "our own explanations about ourselves and our behaviour are mostly lies, fabrications, hallucinations, faulty re-memorization, post hoc reasoning" and in fact supports the opposite.


One primary explanation of ourselves is that there is in fact a "self" there, we feel this "self" as being permanent, continuous through time, yet we are absolutely sure that is a lie: there are no continuous processes in the entire universe, energy itself is quantized.

In the morning when we wake up, we are "booting" up the memories the brain finds and we believe that we have persisted through time, from yesterday to today, yet we are absolutely sure that is a lie: just look at an Alzheimer patient.

We are feeling this self as if it's somewhere above the neck and we feel like this self is looking at the world and sees "out there", yet we are absolutely sure that is a lie: our senses are being overflown by inputs and the brain filters them, shapes a model of the world, and presents that model to the internal model of itself, which gets so immersed into model of the world that starts to believe the model is indeed the world, until the first bistable image [1] breaks the model down.

[1] https://www.researchgate.net/profile/Amanda-Parker-14/public...


Not supported by neuroimaging. Promoted without evidence or sufficient causal inference.

https://www.health.harvard.edu/blog/right-brainleft-brain-ri... :

> But, the evidence discounting the left/right brain concept is accumulating. According to a 2013 study from the University of Utah, brain scans demonstrate that activity is similar on both sides of the brain regardless of one's personality.

> They looked at the brain scans of more than 1,000 young people between the ages of 7 and 29 and divided different areas of the brain into 7,000 regions to determine whether one side of the brain was more active or connected than the other side. No evidence of "sidedness" was found. The authors concluded that the notion of some people being more left-brained or right-brained is more a figure of speech than an anatomically accurate description.

Here's wikipedia on the topic: "Lateralization of brain function" https://en.wikipedia.org/wiki/Lateralization_of_brain_functi...

Furthermore, "Neuropsychoanalysis" https://en.wikipedia.org/wiki/Neuropsychoanalysis

Neuropsychology: https://en.wikipedia.org/wiki/Neuropsychology

Personality psychology > ~Biophysiological: https://en.wikipedia.org/wiki/Personality_psychology

MBTI > Criticism: https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indi...

Connectome: https://en.wikipedia.org/wiki/Connectome


You are talking about the popular narrative of “left brain” thinking being more logical and “right brain” thinking being more creative. You are correct this is unsupported.

The post you are replying to is talking about the small subset of individuals who have had their corpus callosum surgically severed, which makes it much more difficult for the brain to send messages between hemispheres. These patients exhibit “split brain” behavior that is well studied by experiments and can shed light into human consciousness and rationality.


This is not relevant to GP's comment. It has nothing to do with "are there fixed 'themes' that are operated in each hemisphere." It has to do with more generally, does the brain know what the brain is doing. The answer so far does not seem to be "yes."


Says who? There is actual evidence to support that our brain doesn't "know" what it is doing on a subconscious level? As far as I'm aware it's more that conscious humans don't understand how our brain works.

I think the correct statement is "so far the answer is we don't know"


The split brain experiments very very clearly indicate that different parts of the brain can independently conduct behavior and gain knowledge independently of other parts.

How or if this generalizes to healthy brains is not super clear, but it does actually provide a good explanatory model for all sorts of self-contradictory behavior (like addiction): the brain has many semi-independent “interests” that are jockeying for overall control of the organism’s behavior. These interests can be fully contradictory to each other.

Correct, ultimately we do not know. But it’s actually a different question than your rephrasing.


Given that functional localization varies widely from subject to subject per modern neuroimaging, how are split brain experiments more than crude attempts to confirm functional specialization (which is already confirmed without traumatically severing a corpus callosum) "hemispheric" or "lateral"?

Neuroimaging indicates high levels of redundancy and variance in spatiotemporal activation.

Studies of cortices and other tissues have already shown that much of the neural tissue of the brain is general purpose.

Why is executive functioning significantly but not exclusively in the tissue of the forebrain, the frontal lobes?


Because there’s a version of specialization that is, “different regions are specialized but they all seem to build consensus” and there’s a version that is “different regions are specialized and consensus does not seem to be necessary or potentially even usual or possible.”

These offer very different interpretations of cognition and behavior, and the split brain experiments point toward the latter.


Functional specialization > Major theories of the brain> Modularity or/and Distributive processing: https://en.wikipedia.org/wiki/Functional_specialization_(bra... :

> Modularity: [...] The difficulty with this theory is that in typical non-lesioned subjects, locations within the brain anatomy are similar but not completely identical. There is a strong defense for this inherent deficit in our ability to generalize when using functional localizing techniques (fMRI, PET etc.). To account for this problem, the coordinate-based Talairach and Tournoux stereotaxic system is widely used to compare subjects' results to a standard brain using an algorithm. Another solution using coordinates involves comparing brains using sulcal reference points. A slightly newer technique is to use functional landmarks, which combines sulcal and gyral landmarks (the groves and folds of the cortex) and then finding an area well known for its modularity such as the fusiform face area. This landmark area then serves to orient the researcher to the neighboring cortex. [7]

Is there a way to address the brain with space-filling curves around ~loci/landmarks? For brain2brain etc

FWIU, Markham's lab found that the brain is at max 11D in some places; But an electron wave model (in the time domain) may or must be sufficient according to psychoenergetics (Bearden)

> Distributive processing: [...] McIntosh's research suggests that human cognition involves interactions between the brain regions responsible for processes sensory information, such as vision, audition, and other mediating areas like the prefrontal cortex. McIntosh explains that modularity is mainly observed in sensory and motor systems, however, beyond these very receptors, modularity becomes "fuzzier" and you see the cross connections between systems increase.[33] He also illustrates that there is an overlapping of functional characteristics between the sensory and motor systems, where these regions are close to one another. These different neural interactions influence each other, where activity changes in one area influence other connected areas. With this, McIntosh suggest that if you only focus on activity in one area, you may miss the changes in other integrative areas.[33] Neural interactions can be measured using analysis of covariance in neuroimaging [...]

FWIU electrons are most appropriately modeled with Minkowski 4-space in the time-domain; (L^3)t

Neuroplasticity: https://en.wikipedia.org/wiki/Neuroplasticity :

> The adult brain is not entirely "hard-wired" with fixed neuronal circuits. There are many instances of cortical and subcortical rewiring of neuronal circuits in response to training as well as in response to injury.

> There is ample evidence [53] for the active, experience-dependent re-organization of the synaptic networks of the brain involving multiple inter-related structures including the cerebral cortex.[54] The specific details of how this process occurs at the molecular and ultrastructural levels are topics of active neuroscience research. The way experience can influence the synaptic organization of the brain is also the basis for a number of theories of brain function


"Representational drift: Emerging theories for continual learning and experimental future directions" (2022) https://www.sciencedirect.com/science/article/pii/S095943882... :

> Recent work has revealed that the neural activity patterns correlated with sensation, cognition, and action often are not stable and instead undergo large scale changes over days and weeks—a phenomenon called representational drift. Here, we highlight recent observations of drift, how drift is unlikely to be explained by experimental confounds, and how the brain can likely compensate for drift to allow stable computation. We propose that drift might have important roles in neural computation to allow continual learning, both for separating and relating memories that occur at distinct times. Finally, we present an outlook on future experimental directions that are needed to further characterize drift and to test emerging theories for drift's role in computation.

So, to run the same [fMRI, NIRS,] stimulus response activation observation/burn-in again weeks or months later with the same subjects is likely necessary given Representational drift.


"EM Wave Polarization Transductions" Lt. Col. T.E Bearden (1999) :

> Physical observation (via the transverse photon interaction) is the process given by applying the operator ∂/∂t to (L^3)t, yielding an L3 output.


You're ignoring several confounders and conflating several loosely related things that I'm not even sure what the point you're making is anymore.

To begin with, the split-brain experiments don't provide clear or strong evidence for anything given the small sample size, heterogeneity in procedure (i.e. was there complete comissurotomy or just callosotomy) and the elapsed time between neuropsychiatric evaluation and initial procedures which relies on the assumption that adaptation does not occur and neuroplasticity is not a thing. The split-brain experiments are notable because the lab experiments SUGGEST the lack of communication between two hemispheres and a split conscious however this is paradoxical with everyday experience of these patients, far from providing evidence for anything.

Ignoring that for a moment, how do the split-brain experiments support 'the brain has many semi-independent “interests” that are jockeying for overall control of the organism’s behavior'?

How is addiction self-contradictory exactly and what does this have to do with split-brain?

If your point is that different parts of the brain (e.g. the reward system and the executive function regions) have different roles this isn't really debatable, obviously different parts of the brain are all doing their individual jobs and the most widely accepted theory is that these are integrated in some unknown mechanism by a single consciousness which remains in control.

Your original statement of: does the brain know what the brain is doing. The answer so far does not seem to be "yes."

Suggests you're arguing that the brain has many different consciouses that are in a constant battle, i.e. there is not a unified consciousness in control of behavior.

To take your addiction example, the brain is very much aware of what it is doing and addiction is not self-contradictory because short term rewards are being valued above long-term ones and health of the organism. The reward system model provides an excellent neurobiological explanation for addiction.

This is not directly evidenced by either addiction or the split brain experiments and is at best a hypothetical model hence why it's a theory and my original response to your statement.


I'm not "ignoring" them at all. I'm saying that they point to interesting questions that are not answered.

> The split-brain experiments are notable because the lab experiments SUGGEST the lack of communication between two hemispheres and a split conscious however this is paradoxical with everyday experience of these patients, far from providing evidence for anything.

It is not "paradoxical" but yes it does conflict with some reported experience. However, even healthy individuals often report being "of two minds" or struggling to "make up their [singular] mind." Why are these utterances to be dismissed while the also-subjectively-reported sensation of unitary experience is taken as fact?

> Suggests you're arguing that the brain has many different consciouses that are in a constant battle, i.e. there is not a unified consciousness in control of behavior.

I wouldn't characterize my position as "many different consciousnesses," but rather that consciousness is dispersed across (at least) the brain. In some scenarios (such as a corpus callosotomy) and perhaps in more everyday scenarios - perhaps all day every day - that dispersed activity can fall out of internal "synchronization." Anyway, you provided the exact same interpretation in the previously quoted section: "the lab experiments SUGGEST the lack of communication and a split consciousness."

You just go one step further of prioritizing the subjectively reported sensation of unitary consciousness over also-subjectively-reported sensation of non-unitary consciousness. That's your prior taking hold, not mine, and not actual evidence.

You yourself admit we do not know the mechanism (if any exists) by which the activity in various parts of the brain are integrated. We do not know if this process actually even occurs!

Regarding addiction, it is very, very commonly reported that addicts will go into "autopilot" like states while satisfying their addictions and only "emerge" when they have to face consequences of their behaviors. Again, subjectively reported, but so is the experience of unitary consciousness! If we cannot trust one then we shouldn't take it as granted that we can trust the other.

I get the sense you think you're arguing against some firmly held belief or a model I'm proposing as fact: you're not! We're both saying "we don't know much about how this works." And no, neurochemical mechanisms are not complete answers to how brain activity ladders up to conscious experience, similar to how a molecular model of combustion cannot explain much about urban traffic patterns.


I'm arguing against the strength of your statements based on methodologically unsound experiments that do not "very very clearly indicate" anything beyond pose a few questions for which there are several different hypothetical answers. All of which have zero evidence behind them.

Similarly, the initial comment of 'does the brain know what the brain is doing. The answer so far does not seem to be "yes."' is misleadingly suggesting there is a shred of evidence supporting that the answer is 'no' or that the answer is 'not yes'. There are no answers so far, just questions.

If anything, there are more unified consciousness hypotheses than otherwise, although if you refer back to my original reply I did not make this assertion: 'I think the correct statement is "so far the answer is we don't know"'

> It is not "paradoxical" but yes it does conflict with some reported experience.

Rather than belabour the experiment results and implications here is a great peer-reviewed article by experts in the field: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7305066/

> Regarding addiction, it is very, very commonly reported that addicts will go into "autopilot" like states while satisfying their addictions and only "emerge" when they have to face consequences of their behaviors. Again, subjectively reported, but so is the experience of unitary consciousness! If we cannot trust one then we shouldn't take it as granted that we can trust the other.

The dopamine reward system understanding (which by the way is probably the most well-validated and widely believed model in neurobiology) provides a rational explanation for addiction.

You haven't explained what is self-contradictory, that a few case reports exist of patients claiming they went in and out of consciousness? That's not a contradiction.


The split brain experiments clearly indicate exactly what I said they clearly indicate. To refresh: "that different parts of the brain can independently conduct behavior and gain knowledge independently of other parts".

From your own linked article: "In short, callosotomy leads to a broad breakdown of functional integration ranging from perception to attention."

If the dopamine reward system is the full answer, then what explains drug addicts anguishing about their addiction while simultaneously actively seeking out their next hit? What part of the brain is producing the anguish if the whole behavior is conclusively described by the dopamine reward system?


Breakdown of functional integration =/= different parts of the brain can independently conduct behavior and gain knowledge.

Still not sure what the contradiction is, is it regret now? Cause that isn’t contradictory.

It’s pretty simple, drugs feel really good when you take them so a single consciousness prioritizes that feeling over long term interests. When one is not taking them and facing the consequences of those decisions they feel bad. To make the bad feelings go away one takes more drugs and the cycle repeats.


Your response doesn't seem to be directly related to the previous poster's split-brain comments, but rather the popular misuse of the lateralization idea.


Agree I'm not a neuro subspecialist but I've listened to some talks at conferences out of interest and I don't think anyone still believes in this anymore. Anecdotally the few fMRI's I reported as a trainee didn't support this either.


Phantoms in the Brain is a fascinating book that deals with exactly this topic.


that is absolutely fascinating and also makes me extremely uncomfortable


Neurology is full of very uncomfortable facts. Here's one for you: there are patients who believe their arm is gone even though it's still there. When the doctor asks whose arm that is, they reply it must be someone else's. The brain can simply refuse to know something, and will adopt whatever delusions and contortions are necessary. Which of course leads to the realization that there could be things we're all incapable of knowing. There could be things right in front of our faces we simply refuse to perceive and we'd never know it.


Famously, our nose is literally right in front of our faces and the brain simply "post-processes" it out of the view.

After breaking my arm, split in two, pinching the nerve and making me unable to move it for about a year, I still feel as if the arm is "someone else's", as if I am moving an object in VR, not something which is "me" or "mine".


Oliver Sacks' A Leg To Stand On is a lengthy discussion of that, including his own experiences after breaking a leg---IIRC, at one point after surgery but before he starts physical therapy, he wakes up convinced that a medical student has played a prank by removing his leg and attaching one from a cadaver, or at least sticking a cadaver's leg under his blanket. (ISTR he tries to throw it out of bed and ends up on the floor.)


It's probably way worse than we can imagine.

Reading/listening to someone like Robert Sapolsky [1] makes me laugh I could have ever hallucinated about such a muddy, not even wrong concept as "free will".

Furthermore, between the brain and, say, the liver there is only a difference of speed/data integrity inasmuch as one cares to look for information processing as basal cognition: neurons firing in the brain, voltage-gated ion channels and gap junctions controlling bioelectrical gradients in the liver, and almost everywhere in the body. Why does only the brain has a "feels like" sensation? The liver may have one as well, but the brain being an autarchic dictator perhaps suppresses the feeling of the liver, it certainly abstracts away the thousands of highly specialized decisions the liver takes each second solving adequately the complex problem space of blood processing. Perhaps Thomas Nagel shouldn't have asked "What Is It Like to Be a Bat?" [2] but what is it like to be a liver.

[1] "Robert Sapolsky: Justice and morality in the absence of free will", https://www.youtube.com/watch?v=nhvAAvwS-UA

[2] https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F


The biggest problem with the current popular idea of “free will” is that people think it means they’re ineffably unpredictable. They’re uncomfortable with the notion that if you were to simulate their brain in sufficient detail, you could predict thoughts and reaction. They take refuge in pseudoscientific mumbling about the links to the Quantum, for they have heard it is special and unpredictable.

And that’s just the polar opposite of having a meaningful will at all. It is good that you are pretty much deterministic. You shouldn’t be deciding meaningful things randomly. If you made 20 copies of yourself and asked them to support or oppose some essential and important political question (about human rights, or war, or what-have-you) they should all come down on the same side. What kind of a Will would that be that chose randomly?


This is why I suggest that curious individuals try a hallucinogen at least once*. It really makes the fragility of our perception, and how it’s held up mostly by itself, very apparent.

* in a safe setting with support, of course.


I can confirm lucid dreaming does the same.


Same. We are so, so profoundly not what it feels like we are, to most of us anyway.

I am morbidly curious how people are going to creatively explain away the more challenging insights AI gives us in to what consciousness is.


If that makes you uncomfortable you definitely should not go reading the evidence supporting the notion that conscious free will is an illusion.

https://www.mpg.de/research/unconscious-decisions-in-the-bra...


My impression is that the understanding of that research that comes up with statements like "But when it comes to decisions we tend to assume they are made by our conscious mind. This is questioned by our current findings" is based on dualistic reasoning.

The idea that there should not be any neural activity before a conscious decision is straight-up dualism---the intangible soul makes a decision and neural activity follows it to carry out the decision.

An alternative way of understanding that result is that the neural activity that precedes the "conscious decision" is the brain's mechanism of coming up with that decision. The "conscious mind" is the result of neural activity, right?


i always thought that the concept of free will didnt bother me. but it turns out i just didn’t understand what it implied. oh dear.

if it turns out that true, it’s truly amazing how well we convince ourselves that we’re in control.

but if our brain controls our actions and not our consciousness, then what is the purpose of consciousness?


Perhaps there is no purpose to consciousness.

Perhaps it's a phenomenon that somehow arises independently ex nihilo from sufficiently complex systems, only ever able to observe, unable to act.

Weird to think about.


Can conscious experience ever arise from matter? Even if the said matter is neural networks? This seems utterly nonsensical to me.


Do you consist of matter? Are you conscious? Are you aware the brain is a neural network?

Let's assume the premise that a form of neural network is necessary but insufficient to give rise to conscious experience. Then might it not matter whether the medium is physical or digital?

If you answer this with anything other than "we don't yet know", then you'll be wrong, because you'll be asserting a position beyond what science is able to currently provide; all of this is an open question. But hint: the evidence is mounting that yes, the medium might not matter.

Once you take on an information theoretic view of consciousness, a lot of possibilities and avenues of research open up.

https://en.m.wikipedia.org/wiki/Neural_correlates_of_conscio...


Sorry, my english is not the best and I don't think there is a word for the thing I'm trying to explain. Meaning of 'consciousness' is too messy.

I know brain is a neural network. I just don't understand how cold, hard matter can result in this experience of consciousness we are living right now. The experience. Me. You. Perceiving. Right now.

I'm not talking about the relation between the brain and our conscious experience. It's obvious that brain is collecting and computing data every second for us to live this conscious experience. The very experience of perceiving, being conscious? The thing we take for granted the most, for that we're not without it any time, except when we're asleep?

Matter is what it is. A bunch of carbon and hydrogen atoms. How does the experience arise from matter? It can't. It is a bunch of atoms. I know how NNs and biological neurons work, still I don't see any way matter can do that. There must be some sort of non-matter essence, soul or something like that.

Is a bunch of electrochemical charges this thing/experience I am living right now? How can it be? Is Boltzmann brain [1] a sensible idea at all?

1: https://en.wikipedia.org/wiki/Boltzmann_brain


You are referring to qualia [1]: feeling something as-is, subjectively. I had similar metaphysics just a few years ago, but in the end, it's just that: bad metaphysics. And it's not even your or my fault: with 2,400+ years of bad metaphysics it's almost only luck to be able to pierce the veil into the metaphysics of tomorrow.

The main point is that with the tremendous discoveries of people such as Church/Turing (matter can be organized in such a fashion as to produce computation) [2] and those forgotten from the first page of history, such as Harold Saxton Burr (matter can be animated through bioelectricity), we no longer are bound to a static metaphysics where objects are made from a material which just sits there. It was obviously never the case, but fighting the phantasms of our own speculation is the hardest fight.

Therefore, no, matter is neither cold, nor hard, and we are surely very far from comprehending all the uses and forms of matter. Just look at all the wood objects around you and think how the same material was available also to Aristotle, 2,400 years ago, and to Descartes, 400 years ago, when they were writing their bad metaphysics, yet they were completely unable to think 1% of the wood objects you have readily available nowadays, cardboard and toothpicks included.

And also, yes, you are electrochemical charges, we all are, what else could we be? We looked insanely deep into the brain [4], there is no magic going on. A caveat would be that, yes, probably, we are not running on the first layer, at the level of sodium ions and neurotransmitters, but that the machinery, the brain, gives rise to a simulation: "only a simulation can be conscious, not physical systems" [5].

[1] https://en.wikipedia.org/wiki/Qualia

[2] https://en.wikipedia.org/wiki/Church%E2%80%93Turing_thesis

[3] https://en.wikipedia.org/wiki/Harold_Saxton_Burr

[4] "The Insane Engineering of MRI Machines", https://www.youtube.com/watch?v=NlYXqRG7lus

[5] https://www.youtube.com/watch?v=tyVyirT6Cyk https://www.youtube.com/watch?v=SbNqHSjwhfs


I did not mean qualia, I knew qualia. I meant that my experience is not matter. Matter is matter and charges are charges, my experience right now is neither. Your perception, the act of you experiencing life right now is not matter.

> We looked insanely deep into the brain [4], there is no magic going on.

Indeed all computation and input collection and such happen in the brain. I just don't understand how I can experience anything if I'm composed only of matter. How come there happens to be a mind? Indeed the electrochemical charges from visual receptors in the eye will be transmitted and computed and memory and dopamine and all the neurons will fire regardless of whether I'm only matter or not. But how can the experiencing consciousness, 'me' arise from matter?

> only a simulation can be conscious, not physical systems

This is what I'm talking about, only that I don't see why simulations are not physical systems.

> yes, you are electrochemical charges, we all are, what else could we be?

It's nonsensical and unscientific to completely rule out the possibility that we can be something else as well, especially when we can't study it directly, like in the example of soul.


Simulations are physical systems just as much as running a "Hello World" program on your computer is a physical system: somewhere some transistors flip, but they are not relevant for the level of description we are interested when running the program, the program output, as simple or complex as it could be. Somewhere in the brain some molecules do "stuff", as a result of the "stuff" the brain sustains one agent, or more [1]. How exactly, in an engineering sense, the agent is constructed is yet to be discovered, hopefully we are only a few years, a few decades, away from building synthetic agents.

Sure, we have about 2,700 years of tradition speaking of souls (considering the major religions: Christianity, Islam, Buddhism, Hinduism, and Judaism). Where did those 2,700 years got us? Has any religion been able to build a conscious agent starting from basic materials (whatever they consider basic, pixie dust if they will)? Have all this years speaking of souls managed to achieve something meaningful, even as a side effect, that actually improves the quality of life? I'm talking hay [2], indoor plumbing, hook-and-loop fasteners, ibuprofen, GPS, voltmeters, extreme ultraviolet lithography, things that you and I can use and rely on daily. I have read pretty much all the major texts of the major traditions, from Mahābhārata to Summa Theologiae, call it intellectual curiosity. If not for the "bragging rights" to say that I know what filioque or bodhipakkhiyādhammā means, I would regret it, wasted time and pointless eye strain. So no, it's not nonsensical and unscientific to rule out a not even hypothesis such as the "soul" after 2,700+ years without any kind of results and absolute incompatibility with the way we actually interact with the world, scientifically or not: photons, atoms, electromagnetic fields and the like.

[1] https://en.wikipedia.org/wiki/Dissociative_identity_disorder

[2] "The technologies which have had the most profound effects on human life are usually simple. A good example of a simple technology with profound historical consequences is hay.", https://quotepark.com/quotes/1924489-freeman-dyson-like-many...


We are composed of matter and we are conscious, we take this as being axiomatic. Given we have one certain example, the question is then how does the body do it, and how could we do it in other substrates: how do we go from "just physics" to mind [1].

The short answer is: chemical reactions start a chain reaction of abstraction towards higher and higher forms of collective intelligence.

For some reason, perhaps something with the way the Hilbert space vector obeying the Schrödinger equation which we usually call "the universe" is [2], but also given the ridiculous breadth and depth of possible explorations of the biological pathways [3], "chunks" of matter tend to group together, they group and form stars and planets, but they also group and form formaldehydes and acetaldehydes and many more. Given enough tries, across multiple environments, in some lucky hydrothermal vents abiogenesis was probably started [4]. Once we had the first agential "chunk" of matter, a group of matter which has a definition of the boundary between a "self", no matter how tiny [5], and an exterior environment, it was more of a game of waiting (~4 billion years) for this tiny agent to grow into an agent with higher-order thinking, self-referentiality, metacognition, and the likes.

Neural networks, as in matrix multiplications, are not conscious because they have no mechanism for deciding what is the environment and what is their own, they are a hammer, sitting there, expecting to be used, not a lacrymaria olor [6], exploring the environment for survival and fun. Could we have neural networks in an agent-like architecture starting to behave more like thermostats, setting goals for themselves? Probably.

[1] "From physics to mind - Prof. Michael Levin", https://youtu.be/_QICRPFWDpg?t=85

[2] "Sean Carroll: Extracting the universe from the wave function", https://www.youtube.com/watch?v=HOssfva2IBo

[3] "Nick Lane, 'Transformer : The Deep Chemistry of Life and Death'", https://youtu.be/bEFzUx_j7tA?t=279

[4] https://en.wikipedia.org/wiki/Hydrothermal_vent

[5] https://en.wikipedia.org/wiki/Physarum_polycephalum

[6] https://www.youtube.com/watch?v=sq6Y54mxjOg


Please see my reply to your sister comment. Thanks for the links, especially 1 btw.


Just wait until you notice how much humans do this day to day


"LLMs are quickly going to be able to start explaining their own thought processes better than any human can explain their own."

There is no "their" and there is no "thought process" . There is something that produces text that appears to humans like there is something like thought going on (cf the Eliza Effect), but we must be wary of this anthropomorphising language.

There is no self reflection, but if you ask an LLM program how "it" knows something it will produce some text.


> There is no self reflection, but if you ask an LLM program how "it" knows something it will produce some text.

To be clear, you're saying that we should just dismiss out-of-hand any possibility that an LM AI might actually be able to explain its reasoning step-by-step?

I find it kind of charming actually how so many humans are just so darn sure that they have their own special kind of cognition that could never be replicated. Not even with 175,000,000,000 calculations for every word generated.


That's a strawman since I didn't argue anything about humans being special. I don't think there is anything necessarily inherently special about human intelligence, I'm just advocating for caution around the language we use to talk about current systems.

All this talk of AGI and sentience and so on is premature and totally unfounded . It's pure sci fi, for now at least.


> I didn't argue anything about humans being special

Above you said about AI LMs:

> There is no "their" and there is no "thought process"

So, unless you're claiming that humans lack a thought process as well, then you're arguing that humans are special.

> All this talk of AGI and sentience and so on is premature and totally unfounded

I don't see any mention of AGI or sentience in this thread?

Also, I don't think anyone could read this transcript with GPT-4 and still claim that it's incapable of a significant degree of self-reflection and metacognition:

https://news.ycombinator.com/item?id=35880148


I reject the words 'self' and 'cognition' in your comment. This is exactly what I'm talking about. A facsimile of them, maybe...


As we don't know for sure what is happening 100% within a neural network, we can say we don't believe that they're thinking and we would still need to define the word thinking. Once LLM's can self-modify, the word "thinking" will be more accurate than it is today.

And when Hinton says at MIT, "I find it very hard to believe that they don't have semantics when they consult problems like you know how I paint the rooms how I get all the rooms in my house to be painted white in two years time," I believe he's commenting on the ability of LLM's to think on some level.


In this case, I think we do if you will check out the paper (https://openaipublic.blob.core.windows.net/neuron-explainer/...). Their method is to

1. Show GPT-4 a GPT-produced text with the activation level of a specific neuron at the time it was producing that part of the text highlighted. They then ask GPT-4 for an explanation of what the neuron is doing.

Text: "...mathematics is done _properly_, it...if it's done _right_. (Take ..."

GPT produces "words and phrases related to performing actions correctly or properly".

2. Based on the explanation, get GPT to guess how strong the neuron activates on a new text.

"Assuming that the neuron activates on words and phrases related to performing actions correctly or properly. GPT-4 guesses how strongly the neuron responds at each token: '...Boot. When done _correctly_, "Secure...'"

3. Compare those predictions to the actual activations of the neuron on the text to generate a score.

So there is no introspection going on.

They say, "We applied our method to all MLP neurons in GPT-2 XL [out of 1.5B?]. We found over 1,000 neurons with explanations that scored at least 0.8, meaning that according to GPT-4 they account for most of the neuron's top-activating behavior." But they also mention, "However, we found that both GPT-4-based and human contractor explanations still score poorly in absolute terms. When looking at neurons, we also found the typical neuron appeared quite polysemantic."


Very true. In my opinion, in case there is a way to extract "Semantic Clouds of Words", i.e given a particular topic, navigate semantic clouds word by word, find some close neighbours of that word, jump to a neighbour of that word and so on, then LLMs might not seem that big of a deal.

I think LLMs are "Semantic Clouds of Words" + grammar and syntax generator. Someone could just discard the grammar and syntax generator, just use the semantic cloud and create the grammar and syntax by himself.

For example, in writing a legal document, a slightly educated person on the subject, could just use the relevant words put into an empty paper, fill in the blanks of syntax and grammar, alongside with the human reasoning which is far superior than any machine reasoning, till today at least.

The process of editing the GPT* generated documents to fix reasoning is not a negligible task anyway. Sam Altman mentioned that: "the machine has some kind of reasoning", not a human reasoning ability by any means.

My point is, that LLMs are two programs fused into one, "word clouds" and "syntax and grammar", sprinkled with some kind of poor reasoning. Their word clouding ability, is so unbelievable stronger than any human it fills me with awe every time i use it. Everything else is, just whatever!


I think they're much more than that. Or rather, if they're "Semantic Cloud of Words", they're still a hundred thousand dimensional clouds of words, and in those hundred thousand dimensions, any relationship you can think of, no matter how obscure, ends up being reflected as proximity along some subset of dimensions.

Looking at it this way, I honestly wouldn't be surprised if that's exactly how "System 1" (to borrow a term from Kahneman) in our brains works.

What I'm saying is:

> In my opinion, in case there is a way to extract "Semantic Clouds of Words", i.e given a particular topic, navigate semantic clouds word by word, find some close neighbours of that word, jump to a neighbour of that word and so on, then LLMs might not seem that big of a deal.

It may be much more of a deal than we'd naively think - it seems to me that a lot of what we'd consider "thinking" and "reasoning" can be effectively implemented as proximity search in a high-dimensional enough vector space. In that case, such extracted "Semantic Cloud of Words" may turn out to represent the very structure of reasoning as humans do it - structure implicitly encoded in all the text that was used as training data for the LLMs.


>if they're "Semantic Cloud of Words", they're still a hundred thousand dimensional clouds of words, and in those hundred thousand dimensions, any relationship you can think of, no matter how obscure, ends up being reflected as proximity along some subset of dimensions.

Yes, exactly that. That's what GPT4 is doing, over billions of parameters, and many layers stacked on top of one another.

Let me give you one more tangible example. Suppose Stable Diffusion had two steps of generating images with humans in it. One step, is taking as input an SVG file, with some simple lines which describe the human anatomy, with body position, joints, dots as eyes etc. Something very simple xkcd style. From then on, it generates the full human which corresponds to exactly the input SVG.

Instead of SD being a single model, it could be multimodal, and it should work a lot better in that respect. Every image generator suffers from that problem, human anatomy is very difficult to get right.[1] The same way GPT4 could function as well. Being multimodal instead of a single model, with the two steps discreet from one another.

So, in some use cases, we could generate some semantic clouds, and generate syntax and grammar as a second step. And if we don't care that much about perfect syntax and grammar, we feed it to GPT2, which is much cheaper to run, and much faster. When i used the paid service of GPT3, back in 2020, the Ada model, was the worst one, but it was the cheapest and fastest. And it was fast. I mean instantaneous.

>the very structure of reasoning as humans do it

I don't agree that the machine reasons even close to a human as of today. It will get better of course over time. However in some not so frequent cases, it comes close. Some times, it seems like it, but only superficially i would argue. Upon closer inspection the machine spits out non sense.

[1] Human anatomy, is very difficult to get right, like an artist. Many/all of the artists, point out the fact, that A.I. art doesn't have soul in the pictures. I share the same sentiment.


The text output of a llm is the thought process. In this context the main difference between humans and llms, is that llms can’t have internalized thoughts. There are of course other differences to, like the fact that humans have a wider gamut of input: visuals, sound, input from other bodily functions. And the fact that we have live training.


It's not clear whether or not LLMs have internal thoughts -- each token generation could absolutely have a ton of thought-like modelling in the hidden layers of the network.

What is known is that these internal thoughts get erased each time a new token is generated. That is, it's starting from scratch from the contents of the text each time it generates a word. But you could postulate that similar prompt text leads to similar "thoughts" and/or navigation of the concept web, and therefore the thoughts are continuous in a sense.


True, LLMs definetly has something that is "thought-like".

But todays networks lacks the recursion(feedback where the output can go directly to the input) that is needed for the type of internalized thoughts that humans have. I guess this is one thing you are pointing at by mentioning the continuousnes of the internals of LLMs.


Or maybe the human thought process isn't as sophisticated as we imagined.


I'm not arguing for or against that. It's more the implications of sentience and selfhood implicit in the language many use around LLMs.


This is it. This comprehension of the chats is symptom of something like linguistic pareidolia. It's an enforced face that is composed of some probabilistic incidents and wistfulness.


Pride comes before the fall, and the AI comes before humility.


Precisely.

When someone states definitively what LLMs can or cannot do, that is when you know to immediately disregard them as the waffling of uninformed laymen lacking the necessary knowledge foundations (cognitive/neuroscience/philosophy) to even appreciate the uncertainty and finer points under discussion (all the open questions regarding human cognition etc).

They don't know what they don't know and make unfounded assertions as result.

Many would do to refrain from speaking so surely about matters they know nothing about, but that is the internet for you.

https://en.wikipedia.org/wiki/Predictive_coding


Clearly there is some process going on to decide what word to pick. Why can’t we call that thinking?


What if you ask it to emit the reflexive output, then feed that reflexive output back into the LLM for the conscious answer?

What if you ask it to synthesize multiple internal streams of thought, for an ensemble of interior monologues, then have all those argue with each other using logic and then present a high level answer from that panoply of answers?


What if you do? LLMs don't have reflexive output or internal streams of thought, they are simply (complex) processes that produce streams of tokens based on an inputted stream of tokens. They don't have a special response to tokens that indicate higher-level thinking to humans.


LLMs seem to me to be the "internal streams of thought". I.e. it's not LLMs that are missing an internal process that humans have, but rather it's humans that have an entire process of conscious thinking built on top of something akin to LLM.


I agree completely and I think this is where a lot of people get tripped up. There's no reason to think an AGI needs to be an LLM alone, it might just be a key building block.


Well put, and I agree. My belief is that if a typical person was drugged or otherwise induced to just blurt out their unfiltered thoughts out loud as it crossed their brain, the level of incohesion and false confidence on display would look a lot like an LLM hallucinating.


That's possible I guess but is there positive evidence for that being the case?


The way I phrased it isn't exactly structured to admit any kind of evidence, so let me destructure it. My observation is that:

- The argument that LLMs are missing introspection / inner voice is based on attempting to compare LLMs directly with human minds.

- Human minds have conscious and unconscious parts; for many people, part of the boundary between unconscious and conscious mind manifests as the "inner voice" - the one that makes verbalized thoughts "appear" in their head (or rather perhaps become consciously observed).

- Based entirely on my own experience in using GPT-3.5 and GPT-4, and my own introspection, I feel that GPT-4 bears a lot of resemblance to my inner voice in terms of functioning.

- Therefore I propose that comparing LLMs directly to human minds is unproductive, and it's much more interesting/useful to compare them to the inner voice in human minds: the part of the boundary between unconscious and conscious that uses natural language for I/O.


If you direct the model output to itself and don't view it otherwise, how is it not an "internal stream of thought"?


You seem to have high confidence in how LLMs work.


If the Gödel incompleteness theorem applies here, then the explanations are likely … incomplete or self-referential.


The Goedel Incompleteness Theorem has no straightforward application to this question.


It would if the language model did reasoning according rules of logic. But they don't. They use Markov chains.

To me it makes no sense to say that a LLM could explain its own reasoning if it does no (logical) reasoning at all. It might be able to explain how the neural network calculates its results. But there are no logical reasoning steps in there that could be explained, are there?


Honest question: are we sure that it doesn’t do logical reasoning?

IANAE but although an LLM meets the definition of a Markov Chain as I understand it (current state in, probabilities of next states out), the big black box that spits out the probabilities could be doing anything.

Is it fundamentally impossible for reasoning to be an emergent property of an LLM, in a similar way to a brain? They can certainly do a good impression of logical reasoning- better than some humans in some cases?

Just because an LLM can be described as a Markov Chain doesn’t mean it _uses_ Markov Chains? An LLM is very different to the normal examples of Markov Chains I’m familiar with.

Or am I missing something?

In any case, coemu is an interesting related idea to constrain AIs to thinking in ways we can understand better:

https://futureoflife.org/podcast/connor-leahy-on-agi-and-cog...

https://www.alignmentforum.org/posts/ngEvKav9w57XrGQnb/cogni...


All programs that you can fit on a computer can be described by a sufficiently large Markov chain(if you imagine all the possible states the memory as nodes) Whatever the human brain is doing is also describable as a massive Markov chain.

But since the markov chain becomes exponentially larger whit the amount of states this is a very nitpicky and meaningless point.

Clearly to say something its a markov chain and have that mean something you need to say the thing its doing could be more or less compressed to a simple markov chain for bigrams or something like that, but that is just not true empirically, not even for gpt2. Just this is already pretty hard to make into a reasonable size markov chain https://arxiv.org/abs/2211.00593.

Just saying that it outputs probabilities from each state is not enough, the states are english strings, there's (number of tokens)^contex_lenght possible states for a certain length that's not a reasonable markov chain that you could actually implement or run.


> Honest question: are we sure that it doesn’t do logical reasoning?

It's not the Creature from the Lagoon, its an engineering artifact created by engineers. I haven't heard them say it does logical deduction according to any set of logic-rules. What I've read is it uses Markov chains. That makes sense because basically an LLM given a string-input should reply with another string that is the most likely follow-up string to the first string, based on all the texts it crawled up from the internet.

If internet had lots and lots of logical reasoning statements then a LLM might be good at producing what looks like logical reasoning, but that would still be just response with the most likely follow-up string.

The reason the results of LLMs are so impressive is that at some point the quantity of the data makes a seemingly qualitative difference. It's like if you have 3 images and show them each to me one after the other I will say I saw 3 images. But if you show me thousands of images 24 per second and the images are small variations of the previous images then I say I see a MOVING PICTURE. At some point quantity becomes quality.


My understanding is that at least one form of training in the RLHF involves supplying antecedent and consequent training pairs for entailment queries.

The LLM seems to be only one of the many building blocks and is used to supply priors / transition probabilities that are used elsewhere in downstream part of the model.


What leads you to suspect that Gödel incompleteness may be relevant here?

There's no formal axiom system being dealt with here, afaict?

Do you just generally mean "there may be some kind of self-reference, which may lead to some kind of liar-paradox-related issues"?


The relevance is because all (all known buildable aka algorithmic, and sufficiently powerful) models of computation are equivalent in terms of formal computability, so if you could violate/bypass the Godel or Turing theorems in neural networks, then you could do it in a Turing machine, and vice versa. (That's my understanding, feel free to correct me if I'm mistaken)


Well... , yeah, but, these models already produce errors for other unrelated reasons, and like...

Well, what exactly would we be showing that these models can’t do? Quines exist, so there’s no general principle preventing reflection in general. We can certainly write poems (etc.) which describe their own composition. A computer can store specifications (and circuit diagrams, chip designs, etc.) for all its parts, and interactively describe how they all work.

If we are just saying “ML models can’t solve the halting problem”, then ok, duh. If we want to say “they don’t prove their own consistency” then also duh, they aren’t formal systems in a sense where “are they consistent (as a formal system)?” even makes sense as a question.

I don’t see a reason why either Gödel or Turing’s results would be any obstacle for some mechanism modeling/describing how it works. They do pose limits on how well they can describe “what they will do” in a sense of like, “what will it ‘eventually’ do, on any arbitrary topic”. But as for something describing how it itself works, there appears to be no issue.

If the task to give it was something like “is there any input which you could be given which would result in an output such that P(input,output)” for arbitrary P, then yeah I would expect such diagonalization problems to pop-up.

But a system having a kind of introspection about how it works, rather than answering arbitrary questions about its final outputs (such as, program output, or whether a statement has a proof), seems totally fine.

Side note: One funny thing: (aiui) it is theoretically possible for a oracle that can have random behavior, to act (in a certain sense) as a halting-oracle for Turing machines with access to the same oracle.

That’s not to say that we can irl construct such a thing, as we can’t even make a halting oracle for normal Turing machines. But, if you add in some random behavior for the oracles, you can kinda evade the problems that come from the diagonalization.


I don't have a rigorous argument but as I recall Rice's theorem says that computers cannot compute nontrivial properties about programs, in general. So we cannot have virus checkers, etc. (Well we actually do have imperfect virus checkers, of course). So following that, a nontrivial property such as self-explanation, e.g. correctly explain what a neural network is doing, for all possible inputs. (But again this is not fully clear to me, someone might point out that for a specific program this is different and Rice's theorem is irrelevant.)


I commented in another answer but you can consult https://etc.cuit.columbia.edu/news/basics-language-modeling-...

Some training forms include entailment : “if A then B”. I hope this is first order logic which does have an axiom system :)


So is the word "word" but that seems to have worked out OK so far. I can explain the meaning of "meaning" and that seems to work OK too. Being self-referential sounds a lot more like a feature than a bug. Given that the neurons in our own heads are connected to each other and not any ground truth, I think LLMs should do just fine.


As long as lazy evaluation exists, self-reference is fine, no?

Hofstadter talks about something similar in his books.


That's probably one of the reasons why you'd use GPT-4 to explain GPT-2.

Of course, if you were trying to use GPT-4 to explain GPT-4 then I think the Gödel incompleteness theorem would be more relevant, and even then I'm not so sure.


Are there any examples of an LLM developing concepts that do not exist or cannot be inferred from its training set?


"Cannot be inferred from its training set" is a pretty difficult hurdle. Human beings can infer patterns that aren't there, and we typically call those hallucinations or even psychoses. On the other hand, some unconfirmed, novel patterns that humans infer actually represent groundbreaking discoveries, like for example much of the work of Ramanujan.

In a real sense, all of the future discoveries of mathematics already exist in the "training set" of our present understanding, we just haven't thought it all the way through yet. If we discover something new, can we say that the concept didn't exist, or that it "couldn't be inferred" from previous work?

I think the same would apply to LLMs and their understanding of the way we encode information using language. Given their radically different approach to understanding the same medium, they are well poised to both confirm many things we understand intuitively as well as expose the shortcomings of our human-centric model of understanding.


I'm really curious what kind of concept you might have in mind. Can you give any example of a concept that if an LLM developed that concept then it would meet your criteria? It might sound like a sarcastic question but it's hard to agree on the meanings of "concepts that do not exist" or "concepts that cannot be inferred" maybe you can give some examples.

EDIT: I see below you gave some examples, like invention of language before it existed, and new theorems in math that presumably would be of interest to mathematicians. Those ones are fair enough in my opinion. The AI isn't quite good enough for those ones I think, but I also think newer versions trained with only more CPU/GPU and more parameters and more data could be 'AI scientists' that will make these kinds of concepts.


It is by definition impossible for an LLM to develop a concept that 'cannot be inferred from its training set'.

On the other hand, that is an incredibly high bar.


Tautologically, every concept that anything (LLM, or human, or alien) develops can be inferred from the input data(e.g. training set), because it was.


No, it wasn't, language itself didn't even exist at one point. It wasn't inferred from training data into existence because such examples existed before. Now we have a dictionary of tens of thousands of words, which describe high level ideas, abstractions, and concepts that someone, somewhere along the line had to invent.

And I'm not talking about imitation nor am I interested in semantic games, I'm talking about raw inventiveness. Not a stochastic parrot looping through a large corpus of information and a table of weights on word pairings.

Has AI ever managed to learn something humans didn't already know? It's got all the physics text books in its data set. Can it make novel inferences from that? How about in math?


> No, it wasn't, language itself didn't even exist at one point.

Language took dozens of millennia to form, and animals have long had vocalizations. Seems like a natural building on top of existing features.

> Has AI ever managed to learn something humans didn't already know?

AlphaZero invented all new categories of strategy for games like Go, when previously we thought almost all possible tactics had been discovered. AIs are finding new kinds of proteins we never thought about, which will blow up the fields of medicine and disease in a few years once the first trials are completed.


> Language took dozens of millennia to form

Sure, but in a simulated evolutionary algorithm, you can hit a few hundred generations in a matter of seconds.

Indeed, the identification of an abstraction, followed by a definition of that abstraction and an enshrinement of the concept in the form of a word or phrase, in and of itself, shortcuts the evolutionary path altogether. AI isn't starting from scratch: it's starting from a dictionary larger than any human alive knows and in-memory examples of humans conversing on nearly every topic imaginable.

We never thought "all possible tactics" had been discovered with Go. We quite literally understood that Go had a more complex search space than Chess, with far more possible moves and outcomes. And I don't think anyone has any kind of serious theorem that "all possible tactics" have been discovered in either game, to this day.

That being said, Go and Chess are enumerable games with deterministic, bounded complexity and state space.

The protein folding example is a neat one, I definitely think it's interesting to see what develops there. However, protein folding has been modeled by Markov State models for decades. The AlphaFold breakthrough is fantastic, but it was already known how to generate models from protein structures: it was just computationally expensive.

It was also carefully crafted by humans to achieve what it did: https://www.youtube.com/watch?v=gg7WjuFs8F4. So this is an example of humans using neural network technology that humans invented to achieve a desired solution to a known problem that they themselves conceived. The AI didn't tell us something we didn't already know. It was an expert system built with teams of researchers in the loop the whole way through.


Coming up with new language is rarely ever coming up with new concepts that didn't exist until the word. We come up with high-level abstractions because there already exists a material system to be described and modeled. Language that doesn't describe anything that already exists is more like babbling.


Not really, no. There are plenty of intangible abstractions that don't describe material systems. Take imaginary numbers, for example. Or the concept of infinity or even zero, neither of which exists in the physical world.

The reason why "naming things" is the other hard problem in computer science, after cache invalidation, is that the process of identifying, creating, and describing ideal abstractions is itself inherently difficult.


The training sets are so poorly curated we will never know...


I’m sure LLMs are quickly going to learn to hallucinate (or let’s use the proper word for what they’re doing: confabulate) plausible-sounding but nonsense explanations of their thought processes at least as well as humans.


I don't think we invent new words just for AI to explain its thought process to us better. AI may explain more elaborately in our language instead.


And if the LLM is the explainer, it can lie to us if 'needed'.


Based on my skimming the paper, am I correct in understanding that they came up with an elaborate collection of prompts that embed the text generated by GPT-2 as well as a representation of GPT-2's internal state? Then, in effect, they simply asked GPT-4, "What do you think about all this?"

If so, they're acting on a gigantic assumption that GPT-4 actually correctly encodes a reasonable model of the body of knowledge that went into the development of LLMs.

Help me out. Am I missing something here?


>If so, they're acting on a gigantic assumption

Yes the initial hypothesis that GPT-4 would know was a gigantic assumption. But a falsifiable one which we can easily generate reproducible tests for.

The idea that simulated neurons could learn anything useful at all was once a gigantic assumption too.


Neural networks aren’t simulated neurons…


Why does it have to understand how the LLMs are built? They have used gpt-4 to just build a classifier for each neuron's activition, and given the nlp abilities of gpt-4, the hope is that it can describe the nature of activation of the neurons.


After GPT-4 generates the hypothesis for a neuron they test it by comparing GPT-4's expectation for where the neuron should fire against where it actually fires.

If you squint it's train/test separation.


To me the value here is not that GPT4 has some special insight into explaining the behavior of GPT2 neurons (they say it's comparable to "human contractors" - but human performance on this task is also quite poor). The value is that you can just run this on every neuron if you're willing to spend the compute, and having a very fuzzy, flawed map of every neuron in a model is still pretty useful as a research tool.

But I would be very cautious about drawing conclusions from any individual neuron explanation generated in this way - even if it looks plausible by visual inspection of a few attention maps.


They also mention they got a score above 0.8 for 1000 neurons out of GPT2 (which has 1.5B (?)).


1.5B parameters, only 300k neurons. The number of connections is roughly quadratic with the number of neurons.


I thought they had only applied the technique to 307,200 neurons. 1,000 / 307,200 = 0.33% is still low, but considering that not all neurons would be useful since they are initialized randomly, it's not too bad.


This isnt exactly building an understanding of LLMs from first principles... IMO we should broadly be following the (imperfect) example set forth by neuroscientists attempting to explain fMRI scans and assigning functionality to various subregions in the brain. It is circular and "unsafe" from an alignment perspective to use a complex model to understand the internals of a simpler model; in order to understand GPT4 then we need GPT5? These approaches are interesting, but we should primarily focus on building our understanding of these models from building blocks that we already understand.


I've been working in systems neuroscience for a few years (something of a combination lab tech/student, so full disclosure, not an actual expert).

Based on my experience with model organisms (flies & rats, primarily), it is actually pretty amazing how analogous the techniques and goals used in this sort of research are to those we use in systems neuroscience. At a very basic level, the primary task of correlating neuron activation to a given behavior is exactly the same. However, ML researchers benefit from data being trivial to generate and entire brains being analyzable in one shot as a result, whereas in animal research elucidating the role of neurons in a single circuit costs millions of dollars and many researcher-years.

The similarities between the two are so clear that I noticed that in its Microscope tool [1], OpenAI even refers to the models they are studying as "model organisms", an anthropomorphization which I find very apt. Another article I saw a while back on HN which I thought was very cool was [2], which describes the task of identifying the role of a neuron responsible for a particular token of output. This one is especially analogous because it operates on such a small scale, much closer to what systems neuroscientists studying model organisms do.

[1] https://openai.com/research/microscope [2] https://clementneo.com/posts/2023/02/11/we-found-an-neuron


Yup, this article on predictive coding for example is particularly interesting.

Lots of parallels to how our brains are thought to work.

https://en.m.wikipedia.org/wiki/Predictive_coding


I don't follow. Neuroscience imaging tools like fMRI are only used because it is impossible to measure the activations of each neuron in a brain in real time (unlike an artificial neural network). This research paper's attempt to understand the role of individual neurons or neuron clusters within a complete network gets much closer to "first principles" than fMRI.


Right so it should be much easier w/ access to every neuron and activation. But the general approach is an experimental one where you try to use your existing knowledge about physics and biology to discern what is activating different structures (and neurons) in the brain. I agree w/ the approach of trying to assign some functionality to individual 'neurons', but I don't think that using GPT4 to do so is the most appealing way to go about that, considering GPT4 is the structure we are interested in decoding in the first place.


All of this seems to lead to something like this paper https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...

On the other hand, I find it plausible that it's fundamentally impossible to assign some functionality to individual 'neurons' due to the following argument:

1. Let's assume that for a system calculating a specific function, there is a NN configuration (weights) so that at some fully connected NN layer there is a well-defined functionality for specific individual neurons - #1 represents A, #2 represents B, #3 represents C etc.

2. The exact same system outcome can be represented with infinitely many other weight combinations which effectively result in a linear transformation (i.e. every possible linear transformation) of the data vector at this layer, e.g. where #1 represents 0.1A + 0.3B + 0.6C, #2 represents 0.5B+0.5C, and #3 represents 0.4B+0.6C - in which case the functionality A (or B, or C) is not represented by any individual neurons;

3. When the system is trained, it's simply not likely that we just happen to get the best-case configuration where the theoretically separable functionality is actually separated among individual 'neurons'.

Biological minds do get this separation because each connection has a metabolic cost; but the way we train our models (both older perceptron-like layers, and modern transfomer/attention ones) do allow linking everything to everything, so the natural outcome is that functionality simply does not get cleanly split out in individual 'neurons' and each 'neuron' tends to represent some mix of multiple functionalities.


Your last idea, that these models’ neurons are all connected in some way, makes me somewhat sceptical of this research by OpenAI. And that their technique of analysis may need to be more fractal or expansive to include groups of neurons, moving all the way up to the entire model.


> in order to understand GPT4 then we need GPT5?

I also found this amusing. But you are loosely correct, AFAIK. GPT-4 cannot reliably explain itself in any context: say the total number of possible distinct states of GPT-4 is N; then the total number of possible distinct states of GPT-4 PLUS any context in which GPT-4 is active must be at least N + 1. So there are at least two distinct states in this scenario that GPT-4 can encounter that will necessarily appear indistinguishable to GPT-4. It doesn't matter how big the network is; it'll still encounter this limit.

And it's actually much worse than that limit because a network that's actually useful for anything has to be trained on things besides predicting itself. Notably, this is GPT-4 trying to predict GPT-2 and struggling:

> We found over 1,000 neurons with explanations that scored at least 0.8, meaning that according to GPT-4 they account for most of the neuron’s top-activating behavior. Most of these well-explained neurons are not very interesting. However, we also found many interesting neurons that GPT-4 didn't understand. We hope as explanations improve we may be able to rapidly uncover interesting qualitative understanding of model computations.

1,000 neurons out of 307,200--and even for the highest-scoring neurons, these are still partial explanations.


There's little reason to think that predicting GPT-4 would be more difficult, only that it would be far more computationally expensive (given the higher number of neurons and much higher computational cost of every test).


Seems like OpenAI is grasping at straws trying to make GPT "go meta".

Reminds me of this Sam Altman quote from 2019:

"We have made a soft promise to investors that once we build this sort-of generally intelligent system, basically we will ask it to figure out a way to generate an investment return."

https://youtu.be/TzcJlKg2Rc0?t=1886


I have a similar feeling, they’ve potentially built the most amazing but commercially useless thing in history.

I don’t mean it’s not useful entirely, but I mean. It’s not useful in that it’s not deterministic enough to be trustworthy, it’s dangerous and really hard to scale therefore it’s more of an academic project than something that will make Altman as famous as Sergey Brin.

I personally take people like Hinton seriously too and think people playing with these things need more oversight themselves.


Time will tell. Anecdotally, I know several professional who find ChatGPT3.5 & 4 to be valuable and willing to pay for access. I certainly save more than $20 per month for my work by using ChatGPT to accelerate my day to day activities.


That’s why I was careful to say it’s not useless.


I wonder will someone please check the neurons associated to the petertodd and other anomalous glitch tokens (https://www.lesswrong.com/posts/jkY6QdCfAXHJk3kea/the-petert...)? I can see the github and I see that for any given neuron you can see associated tokens but I don't see how to do an inverse search.


Those were discovered by finding strings that OpenAI’s tokenizer didn’t properly split up. Because of this, they are treated as singular tokens, and since these don’t occur frequently in the training data, you get what are effectively random outputs when using them.

The author definitely tries to up the mysticism knob to 11 though, and the post itself is so long, you can hardly finish it before seeing this obvious critique made in the comments.


The ironic thing about lesswrong is that it’s quite the opposite in some fantastically oblivious ways.


Yeah, it’s quite strange indeed. Clearly people with decent educations but zero background in applied research/peer review. More concerned with the sound of their own voice than with whether or not their findings are actually useful (or even true).

Perhaps they are all on stimulants!


Thank you for your opinion on the post that I linked! I'm still curious about the associated neurons though.


Fair enough. You would need to use an open model or work at OpenAI. I assume this work could be used on the llama models - although I’m not aware of anyone has found these glitchy phrases for those models yet.


> You would need to use an open model or work at OpenAI.

The point of this post that we are commenting under is that they made this association public, at least in the neuron->token direction. I was thinking some hacker (like on hacker news) might be able to make something that can reverse it to the token->neuron direction using the public data so we could see the petertodd associated neurons. https://openaipublic.blob.core.windows.net/neuron-explainer/...


I think this method will do a poor job at explaining petertodd. These neuron explanations still have to fit within human language for this method to work, and the best you can do in the confines of human language to describe petertodd is to write a long article (just like that link) explaining the many oddities of it.

Would be interesting to try, though. I think it's likely that, due to the way glitch tokens happen, petertodd is probably an input neuron that is very randomly connected to a bunch of different hidden neurons. So it introduces some bizzare noise into a bunch of areas of the network. It's possible that some of these neurons are explainable on their own, but not within the broader context of petertodd.


ok but i still want to see it


For people overwhelmed by all the AI science speak, just spend a few minutes with bing or phind and it will explain everything surprisingly well.

Imagine telling someone in the middle of 2020, that in three years a computer will be able to speak, reason and explain everything as if it was a human, absolutely incredible!



I agree it’s crazy good. But timeline-wise, GPT-3 was in beta and used by many companies in 2020.


Commenters here seem a little fixated on the fact that the technique scores poorly. This is true and somewhat problematic but exploring the data it looks to me like this could be more a problem with a combination of the scoring function and the limits of language rather than the methodology.

For example, look at https://openaipublic.blob.core.windows.net/neuron-explainer/...

It's described as "expressions of completion or success" with a score of 0.38. But going through the examples, they are very consistently a sort of colloquial expression of "completion/success" with a touch of surprise and maybe challenge.

Examples are like: "Nuff said", "voila!", "Mission accomplished", "Game on!", "End of story", "enough said", "nailed it" etc.

If they expressed it as a basket of words instead of a sentence, and could come up with words that express it better I'd score it much higher.


> "This work is part of the third pillar of our approach to alignment research: we want to automate the alignment research work itself."

I feel like this isn't a Yud-approved approach to AI alignment.


Honestly, I think any foundational work on the topic is inherently Yud-favored, compared to the blithe optimism and surface-level analysis at best that is usually applied to the topic.

Ie, I think it's not that this shouldn't be done. This should certainly be done. It's just that so many more things than it should be done before we move forward.


DISCLAIMER: I think Yudkowsky is a serious thinker and his ideas should be taken seriously, regardless of whether not they are correct.

Your comment triggered a random thought: A perfect name for Yudkowsky et al and the AGI doomers is... wait for it... the Yuddites :)


Already used on 4chan :)


These were my thoughts exactly. On one hand, this can enable alignment research to catch up faster. On the other hand, if we are worried about homicidal AI, then putting it in charge of policing itself (and training it to find exploits in a way) is probably not ideal.


"Yud-approved?"



He's the one in the fedora who is losing patience that otherwise smart sounding people are seriously considering letting AI police itself https://www.youtube.com/watch?v=41SUp-TRVlg


That's not a fedora, that's his King of the Redditors crown



You mean Yudkowski? I saw him on Lex Fridman and he was entirely unconvincing. Why is everyone deferring to a bunch of effective altruism advocates when it comes to AI safety?


I heard him on Lex too, and it seemed to be just a given that AI is going to be deceptive and want to kill us all. I don't think there was a single example of how that could be accomplished given. I'm open to hearing thoughts on this, maybe I'm not creative enough to see the 'obvious' ways this could happen.


This is also why I go into chess matches against 1400 elo players. I cannot conceive of the specific ways in which they will beat me (a 600 elo player), so I have good reason to suspect that I can win.

I'm willing to bet the future of our species on my consistent victory in these types of matches, in fact.


Again, a given that AI is adversarial. Edit: In addition, as an 1100 elo chess player, I can very easily tell you how a 1600 player is going to beat me. The analogy doesn't hold. I'm in good faith asking how AI could destroy humanity. It seems given the confidence people who are scared of AI have in this, that they have some concrete examples in mind.


No it’s a given that some people who attempt to wield AI will be adversarial.

In any case a similar argument can be made with merely instrumental goals causing harm: “I am an ant and I do not see how or why a human would cause me harm, therefore I am not in danger.”


People wielding AI and destroying humanity is very different from AI itself, being a weird alien intelligence, destroying humanity.

Honestly if you have no examples you can't really blame people for not being scared. I have no reason to think this ant-human relationship is analogous.

And seriously, I've made no claims that AI is benign so please stop characterizing my claims thusly. The question is simple, give me a single hypothetical example of how an AI will destroy humanity?


Sure, here’s a trivial example: It radicalizes or otherwise deceives an employee at a virus research lab into producing and releasing a horrific virus.

The guy at Google already demonstrated that AIs are able to convince people of fairly radical beliefs (and we have proof that even humans a thousand years ago were capable of creating belief systems that cause people to blow themselves up and kill thousands of innocent people).

P.S. I was not characterizing your opinion, I was speaking in the voice of an ant.


If AI is general, it will also have the capability of wielding itself, and probably better than humans.


Other caveman use fire to cook food. Fire scary and hurt. No understand fire. Fire cavemen bad.


Other caveman use nuke to wipe out city. Nuke scary and hurt. No understand nuke. Nuke caveman bad.

Other caveman use anthrax in subway station. Anthrax scary and hurt…

Is AI closer to fire or closer to nukes and engineered viruses? Has fire ever invented a new weapon system?

By the way: we have shitloads of regulations and safety systems around fire due to, you guessed it, the amount of harm it can do by accident.


IMHO the argument isn't that AI is definitely going to be deceptive and want to kill us all, but rather that if you're 90% sure that AI is going to be just fine, that 10% of existential risk is simply not acceptable, so you should assume that this level of certainty isn't enough and you should act as if AI may be deceptive and may kill us all and take very serious preventive measures even if you're quite certain that it won't be needed - because "quite certain" isn't enough, you want to be at "this is definitely established to not lead to Skynet" level.


And even with all that, probably it's best to still exercise an abundance of caution, because you might have made a mistake somewhere.


Because they have arguments that AI optimists are unable to convincingly address.

Take this blog post for example, which between the lines reads: we don't expect to be able to align these systems ourselves, so instead we're hoping these systems are able to align each other.

Consider me not-very-soothed.

FWIW, there are plenty of AI experts who have been raising alarms as well. Hinton and Christiano, for example.


People won't care until an actually scary AI exists. Will be easy to stop at that point. Or you can just stop research here and hope another country doesn't get one first. Im personally skeptical it will exist. Honestly might be making it worse with the scaremongering coming from uncharismatic AI alignment people.


Are you kidding? "easy to stop"? When LLMs are integrated into law enforcement, banking, research, education, logistics... All these areas have people building backend systems leveraging the current llm tech and pipelines to plug in the coming tech. If we reach a tipping point where these things become aware/act truly autonomously, what are the chances they do it before we notice? People are renowned for implementing things before understanding the consequences.

And what does charisma of AI alignment folks have to do with anything?


Why would it be easy to stop at that point? The believable value prop will increase in lockstep with the believable scare factor, not to mention the (already significant) proliferation out of ultra expensive research orgs into open source repos.

Nuclear weapons proliferated explicitly because they proved their scariness.


If AI can exist humans have to figure it out. It’s what we do. Really shockingly delusional to think people are gonna use chatgpt for a few min get bored and then ban it like it’s a nuke. I’d rather the USA get it first anyways.


>If AI can exist humans have to figure it out. It’s what we do.

We have figured out stuff in the past, but we also came shockingly close to nuclear armageddon more than once.

I'm not sure I want to roll the dice again.


Where did I say we could or should ban it like a nuke?

Anyway this is a good example of the completely blind-faith reasoning that backs AI optimism: we’ll figure it out “because it’s what we do.”

FWIW we have still not figured out how to dramatically reduce nuclear risk. We’re here just living with it every single day still, and with AI we’re likely stepping onto another tightrope that we and all future generations have to walk flawlessly.


> Why is everyone deferring to a bunch of effective altruism advocates when it comes to AI safety?

I'm not sure Yudkowski is an EA, but the EAs want him in their polycule.


He posts on the forum. I'm not sure what more evidence is needed that he's part of it.

https://forum.effectivealtruism.org/users/eliezeryudkowsky


I guess it's true, not just a rationalist but also effective altruist!


Agreed, Yud does seem to have been right about the course things will take but I'm not confident he actually has any solutions to the problem to offer


His solution is a global regulatory regime to ban new large training runs. The tools required to accomplish this are, IMO, out of the question but I will give Yud credit for being honest about them while others who share his viewpoint try to hide the ball.


Good point, but I think if he was serious he would be looking to Serena Butler for inspiration not nuclear non-proliferation treaties


Which things has he been right about and when, if you recall?


At least to me, this current generation of AI progress is looking a lot more "foomy" than a lot of people besides him predicted


Why does this matter?


I think this is a generous usage of "can." As the article admits, these explanations are 'imperfect' and I think that is definitely true.


It depends how you parse it. It is clearly true that they 'can' explain neurons, in the sense that at least some of the neurons are quite well explained. On the other hand, it's also the case that the vast majority of neurons are not well explained at all by this method (or likely any method).

I̵t̵'̵s̵ ̵o̵n̵l̵y̵ ̵b̵e̵c̵a̵u̵s̵e̵ ̵o̵f̵ ̵a̵ ̵q̵u̵i̵r̵k̵ ̵o̵f̵ ̵A̵d̵a̵m̵W̵ ̵t̵h̵a̵t̵ ̵t̵h̵i̵s̵ ̵i̵s̵ ̵p̵o̵s̵s̵i̵b̵l̵e̵ ̵a̵t̵ ̵a̵l̵l̵,̵ ̵i̵f̵ ̵G̵P̵T̵-̵2̵ ̵w̵a̵s̵ ̵t̵r̵a̵i̵n̵e̵d̵ ̵w̵i̵t̵h̵ ̵S̵G̵D̵ ̵a̵l̵m̵o̵s̵t̵ ̵n̵o̵ ̵n̵e̵u̵r̵o̵n̵s̵ ̵w̵o̵u̵l̵d̵ ̵b̵e̵ ̵i̵n̵t̵e̵r̵p̵r̵e̵t̵a̵b̵l̵e̵.̵

EDIT: This last part isn't true. I think they are only looking at the intermediate layer of the FFN which does have a privileged basis.


> EDIT: This last part isn't true. I think they are only looking at the intermediate layer of the FFN which does have a privileged basis.

it does?


Yeah, that's where they apply the activation function and that happens per neuron so you can't rotate everything and expect the same result.


I'm split between what's more impressive:

- The software powering the research paper

- The research itself (holy moly! They're showing the neurons!)


This blog post is not very informative. How did they prompt GPT4 to explain the neuron's behavior?


It's explained more in the "read paper" link, where they provide the actual prompts:

https://openaipublic.blob.core.windows.net/neuron-explainer/...


Language models can also tell you they’re not AI.


With dubious confidence too!


IMO this is akin to how scientists developed thermodynamics to comprehend steam engines. Except now, AI could be on the forefront of forming conjectures and explanatory theories!


We don’t really know how harmful this stuff will be, but we’ll keep on making it more powerful (and keep on bragging about it) until we find out.


Even if we can explain the function of a single neuron what do we gain? If the goal is to reason about safety of computer vision in automated driving as an example, we would need to understand the system as a whole. The whole point of neural networks is to solve nuanced problems we can't clearly define. The fuzziness of the problems those systems solve is fundamentally at odds with the intent to reason about them.


> reason about safety of computer vision in automated driving

An interesting analogue. I think we simply aren't going to reason about the internals of neural networks to analyze safety for driving, we're just going to measure safety empirically. This will make many people very upset but it's the best we can do, and probably good enough.


I have to agree.

I often think, “maybe I should use ChatGPT for this” then I realise I have very little way to verify what it tells me and as someone working in engineering, If I don’t understand the black box, I just can’t do it.

I’m attracted to open source, because I can look at the code understand it.


Open source doesn't mean you can explain the black box any better. and humans are black boxes that don't understand their mental processes either. We're currently better than LLMs at it i suppose but we're still very poor at it.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3196841/ https://pure.uva.nl/ws/files/25987577/Split_Brain.pdf https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4204522

We can't recreate previous mental states, we just do a pretty good job (usually) of rationalizing decisions after the fact.


Humans can be held accountable so it’s not the same. Even if we’re a black box, we share common traits with other humans. We’re trained in similar ways. So we mostly understand what we will and won’t do.

I think this constant degradation of humans is really foolish and harmful personally. “We’re just black boxes etc”, we might not know how brains work but we do and can understand each other.

On the other hand I’m starting to feel like “AI researchers” are the greatest black box I’ve ever seen, the logic of what they’re trying to create and their hopes for it really baffle me.

By the way, I have infinitely more hope of understanding an open source black box compared to a closed source one?


>Humans can be held accountable so it’s not the same.

1. Don't worry, LLMs will be held accountable eventually. There's only so much embodiment and unsupervised tool control we can grant machines before personhood is in the best interests of everybody. May be forced like all the times in the past but it'll happen.

2. not every use case cares about accountability

3. accountability can be shifted. we have experience.

>I think this constant degradation of humans is really foolish and harmful personally.

Maybe you think so but there's nothing degrading about it. We are black boxes that poorly understand how said box actually works even if we like to believe otherwise. Don't know what's degrading about stating truth that's been backed by multiple studies.

Degrading is calling an achievement we hold people in high regard who accomplish stupid because a machine can do it.

>By the way, I have infinitely more hope of understanding an open source black box compared to a closed source one?

Sure i guess so.


Degrading is calling an achievement we hold people in high regard who accomplish stupid because a machine can do it.

Not sure you worded this as intended ?

Anyway if I read you correctly, this assumes you believe the idea of self and ego have anything to do with it.

Humans should treat ants, lab rats and each other with equal respect.

I don’t believe we should avoid self-degradation because we think we’re smart or special, but for completely opposite reasons. We are mostly lucky we have what we have because something bigger than us, call it God, nature whatever, has provided that existence. When we degrade one another, we degrade that magic. This is where we fuck up time and time again. I’m talking about the water you drink, the food you eat and the air your breathe, the inspiration for neural networks etc. We take that for granted.

I liken it to the idea that humans killed God, the idea of God and morals etc, so we could do terrible things to the world and living things. We just got rid of the idea someone is looking over our shoulder because it made wars and genocides easier to do. Killing God got rid of a whole lot of moral baggage.


If there is a limited number of deception neurons, we can automatically look for coactivation of the "my future behavior" neuron and the deception neuron.


My take: Regurgitation of trained-on information about LLMs does not come anywhere close to "conscious brain knows it's conscious."


Has anyone here found a link to the actual paper? If I click on 'paper', I only see what seems to be an awkward HTML version.


You mean this? https://openaipublic.blob.core.windows.net/neuron-explainer/...

Would you prefer a PDF?

(I'm always fascinated to hear from people who would rather read a PDF than a web-native paper like this one, especially given that web papers are actually readable on mobile devices. Do you do all of your reading on a laptop?)


With a pdf I don't have to update my PDF reader multiple times per month just to be able to read text.

A PDF is a text document that includes all the text, images, etc within it in the state you are going to perceive them. That web page is just barely even a document. None of it's contents are natively within it, it all requires executing remote code which pulls down more remote code to run just to get the actual text and images to display... which they don't in my browser. I just see an index with links that don't work and the the "Contributions" which for some reason was actually included as text.

Even as the web goes up it's own asshole in terms of recursive serial loading of javascript/json/whatever from unrelated domains and abandons all backwards compatibility, PDF, as a document, remains readable. I wish the web was still hyperlinked documents. The "application" web sucks for accessibility.


The equations look terrible on Firefox for Android, as they are really small - a two-line fraction is barely taller than a single line, forcing me to constantly zoom in and out.

So yes, I would prefer a PDF and have a guarantee that it will look the same no matter where I read it.


> Would you prefer a PDF?

Yes, I was just reading the paper and some of the javascript glitched and deleted all the contents of the document except the last section, making me lose all context and focus. Doesn't really happen with PDF files.


> always fascinated

That feels like a loaded phrase. Is it "false confusion" adjacent?


My whole workflow of organizing and reading papers is centered on PDFs. While I like having interactive supplemental materials, I want to be able to print, save and annotate the papers I read.


If you want to draw on it, PDF is usually the best


Nope, reading the printed paper on... paper. :)


I personally prefer reading PDF on an iPad so I can mark it up.


How long until we copyright the Marvel Neuron?


I’m most surprised by the approach they take of passing GPT tuples of (token, importance) and having the model reliably figure out the patterns.

Nothing would suggest this should work in practice, yet it just… does. In more or less zero shot. With a completely different underlying model. That’s fascinating.


They’re not looking at activations?


No just at normalized log probabilities I think


At some point did we change the name from perceptron to neuron? Neural networks don't have neurons.



Now grab the list of labels/explanations for each neuron, and train a small LLM only with data for that neuron.

Then you get a dictionary/index of LLMs

Could this be used to parallelize training?

Or create lighter overall language models?

The above would be like doing a “map”, how would we do a “reduce”?


I’m so interested in this. Any ideas how I can get involved with only a 2014 laptop?



Wow— this page is superb.



By analyzing the function of individual neurons, these guys were/are able to gain a deeper understanding of how language models process language, which could lead to improved model architecture and training methods.


OpenAI need to hear an explanation of the word "explain".


This is really interesting. Could this lead to eventually being able to deconstruct these "black-boxes" to remove proprietary data or enforce legal issues?


Reminds me of the boo Blink, by Malcolm Gladwell -- We feel like we have to come up with a reason why we came up with an answer (paraphrasing)


I wish there was insightful explanations on why AI cannot think, and if there are researchers trying to explore this topic, and if yes what they do.


Any discussion on this topic would be nothing but arguments over the definition of the word 'think'.


Can anyone explain what they did? I’m not understanding from the webpage or the paper. What role does gpt4 play?

I’m seeing they had gpt4 label every neuron but how?


I got the impression that it mentioned that the complexity of what's going on in GPT is so complex that we should use GPT to explain/summarize/graph what is going on.

We should ask AI, how are you doing this?


Operator: Skynet, are you doing good thing? Skynet: Yes.


Automating safeguards and interpretability decisions seems circular and likely to detach to policy.


its interesting that layer 0 is a bunch of different things like upper case letters and symbols and types of verbs.

it would be great to see all the things theyve found for different layers


Can't wait for West World style sex robots!


Matrices, all the way down


LLMs are a strange loop.


Can't see how using AI to explain AI could ever go wrong


Skynet anybody?


Great. Going meta with an introspective feedback loop.

Let's see if that's the last requisite for exponential AGI growth...

Singoolaretee here we go..............


There is no introspection here.


    ...our approach to alignment research: we want to automate the alignment research work itself. A promising aspect of this approach is that it scales with the pace of AI development. As future models become increasingly intelligent and helpful as assistants, we will find better explanations.
The distance between "better explanations" and using that as input of prompts that would automate self-improve is very small, yes?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: