I'm quite worried by this post. After reading the first paragraph it is obvious that this is AI generated, or at least heavily edited by AI. Read some of the other comments on this post for examples of how this is obvious.
The author is a professor in Computer Science at Yale. As well as an ex-research at Microsoft and IBM. You would think this person has the necessary writing skills to write this article themselves. There is no excuse to use AI for writing this. It makes it much harder to read. And always leaves me wondering if I don't understand the point the author is trying to make or if there wasn't one to begin with.
Overall I'm just annoyed by how often I see people use genAI to write stuff for them. Do they think people won't realize it is generated by AI? Do they just not care? Or has it become socially acceptable to write emails, articles, and memos with AI? Just give me the f***g prompt. Then at least I don't have to deal with reading the AI slop.
Let alone how ironic it is to write an article that talks about AI in education, with AI.
PS: I'm not trying to attack the author. This is becoming a widespread issue and I don't want to single him out.
You're seeing behind the curtain. It turns out that professors at Yale are morons, just like everyone else. As are the people that work at Microsoft and IBM. Almost everyone in the world is.
We have a lot of good marketing to try and persuade people that everyone isn't a moron, but ultimately you can't change the fact that they are. That's why you'll notice that 99% of work in every organisation is done by a handful of people.
People don't care, in fact the average person finds LLM writing to be some of the best since it's so good at effortlessly conveying information (of what value..?). That ease of course comes with the loss of the real intent of the author. No longer their words, bearing no mark of effort and discernment, it becomes assimilated into the corpus of all that unthinking, unfeeling machination that lingers breathlessly with no pulse. No life exists in those empty words, no one who will take the stand to defend them, to hold them as true by the fire they alight in oneself. No, the LLM provides a cold flame, only pretty to look at - providing only the image of warmth. But perhaps the LLM reflects the soul of the modern person, who in becoming more machine like, comes to find their home amongst the shadows of that dim blue light.
I do not understand why so many people are so supremely confident that they can accurately identify AI-assisted writing. The "tells" they cite generally strike me as unremarkable features of normal human writing.
it definitely can, but the question is for how long and to what extent; historically, players with power and money will always want more and so things tilt in that direction....
In my experience LLMs have a hard time working with text grids like this. It seems to find columns harder to “detect” then rows. Probably because it’s input shows it as a giant row if that makes sense.
It has the same problem with playing chess.
But I’m not sure if there is a datatype it could work with for this kinda game. Currently it seems more like LLMs can’t really work on spacial problems. But this should actually be something that can be fixed (pretty sure I saw an article about it on HN recently)
Good point. The architectural solution that would come to mind is 2D text embeddings, i.e. we add 2 sines and cosines to each token embedding instead of 1. Apparently people have done it before: https://arxiv.org/abs/2409.19700v2
I think I remember one of the original ViT papers saying something about 2D embeddings on image patches not actually increasing performance on image recognition or segmentation, so it’s kind of interesting that it helps with text!
> We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4).
Although it looks like that was just ImageNet so maybe this isn't that surprising.
They seem to have used a fixed input resolution for each model, so the learnable 1D position embeddings are equivalent to learnable 2D position embeddings where every grid position gets its own embedding. It's when different images may have a different number of tokens per row that the correspondence between 1D index and 2D position gets broken and a 2D-aware position embedding can be expected to produce different results.
Transformers can easily be trained / designed to handle grids, it's just that off the shelf standard LLMs haven't been particularly, (although they would have seen some)
Vision transformers effectively encode a grid of pixel patches. It’s ultimately a matter of ensuring the position encoding incorporates both X and Y and position.
For LLMs we only have one axis of position and - more importantly - the vast majority of training data only is oriented in this way.
Yes, that's why ChatGPT can look at an image and change the style, or edit things in the image. The image itself is converted to tokens and passed to the LLM.
LLMs can be used as an agent to do all sorts of clever things, but it doesn’t mean the LLM is actually handling the original data format.
I’ve created MCP servers that can scrape websites but that doesn’t mean the LLM itself can make HTTP calls.
The reason I make this distinction is because someone claimed that LLMs can read images. But they don’t. They act as an agent for another model that reads images and creates metadata from it. LLMs then turn that meta data into natural language.
The LLM itself doesn’t see any pixels. It sees textual information that another model has provided.
Edit: reading more about this online, it seems LLMs can work with pixel level data. I had no idea that was possible.
No problem. Again, if it happened the way you described (which it did, until GPT-4o recently), the LLM wouldn't have been able to edit images. You can't get a textual description of an image and reconstruct it perfectly just from that, with one part edited.
A few days ago I tested Claude Code by completely vibe coding a simple stock tracker web app in streamlit python. It worked incredibly well, until it didn't. Seems like there is a critical project size where it just can't fix bugs anymore.
Just tried this with Gemini CLI and the critical project size it works well for seems to be quite a bit bigger. Where claude code started to get lost, I simply told Gemini CLI to "Analyze the codebase and fix all bugs". And after telling it to fix a few more bugs, the application simply works.
I wonder how much of this had to do with the context window size? Gemini’s window is 5x larger than Cladue’s.
I’ve been using Claude for a side project for the past few weeks and I find that we really get into a groove planning or debugging something and then by the time we are ready to implement, we’ve run out of context window space. Despite my best efforts to write good /compact instructions, when it’s ready to roll again some of the nuance is lost and the implementation suffers.
I’m looking forward to testing if that’s solved by the larger Gemini context window.
I definitely think the bigger context window helps. The code quality quite visibly drops across all models I've used as the context fills up, well before the hard limit. The editor tooling also makes a difference—Claude Code pollutes its own context window with miscellaneous file accesses and tool calls as it tries to figure out what to do. Even if it's more manual effort to manage the files that are in-context with Aider, I find the results to be much more consistent when I'm able to micromanage the context.
Approaching the context window limit in Claude Code, having it start to make more and worse mistakes, then seeing it try to compact the context and keep going, is a major "if you find yourself in a hole, stop digging" situation.
I've found that I can quickly get a new AI session up to speed by adding critical context that it's missing. In my largest codebase it's usually a couple of critical functions.once they have the key context, they can do the rest. This of course doesn't work when you can't view their thinking process and interrupt it to supply them with the context that they are missing. Opacity doesn't work unless the agent does the right thing every time.
I thought I read that best practice was to start a new session every time you work on a new feature / task. That’s what I’ve been doing. I also often ask Claude to update my readme and claude.md with details about architecture or how something works.
As for /compact, if I’m nearing the end of my context window (around 15%) and are still in the middle of something, I’ll give /compact very specific details about how and what to compact. Let’s say we are debugging an error - I might write something along the lines of “This session is about to close and we will continue debugging in next session. We will be debugging this error message [error message…]. Outline everything we’ve tried that didn’t work, make suggestions about what to try next, and outline any architecture or files that will be critical for this work. Everything else from earlier on in this session can be ignored.” I’ve had decent success with that. More so on debugging than trying to hand off all the details of a feature that’s being implemented.
Reminder: you need context space for compact, so leave a little head room.
The best approach is never to get remotely close to the point where it auto-compacts. Type /clear often, and set up docs, macros etc to make it easy to built the context you need for new tasks quickly.
If you see that 20% remaining warning, something has gone badly wrong and results will probably not get better until you clear the context and start again.
Current best practice for Claude Code is to have heavy lifting done by Gemini Pro 2.5 or o3/o3pro. There are ways to do this pretty seamlessly now because of MCP support (see Repo Prompt as an example.) Sometimes you can also just use Claude but it requires iterations of planning, integration while logging everything, then repeat.
I haven't looked at this Gemini CLI thing yet, but if its open source it seems like any model can be plugged in here?
I can see a pathway where LLMs are commodities. Every big tech company right now both wants their LLM to be the winner and the others to die, but they also really, really would prefer a commodity world to one where a competitor is the winner.
If the future use looks more like CLI agents, I'm not sure how some fancy UI wrapper is going to result in a winner take all. OpenAI is winning right now with user count by pure brand name with ChatGPT, but ChatGPT clearly is an inferior UI for real work.
I think, there are different niches. AI works extremely well for Web prototyping because a lot of that work is superficial. Back in the 90s we had Delphi where you could make GUI applications with a few clicks as opposed to writing tons of things by hand. The only reason we don't have that for Web is the decentralized nature of it: every framework vendor has their own vision and their own plan for future updates, so a lot of the work is figuring out how to marry the latest version of component X with the specific version of component Y because it is required by component Z. LLMs can do that in a breeze.
But in many other niches (say embedded), the workflow is different. You add a feature, you get weird readings. You start modelling in your head, how the timing would work, doing some combination of tracing and breakpoints to narrow down your hypotheses, then try them out, and figure out what works the best. I can't see the CLI agents do that kind of work. Depends too much on the hunch.
Sort of like autonomous driving: most highway driving is extremely repetitive and easy to automate, so it got automated. But going on a mountain road in heavy rain, while using your judgment to back off when other drivers start doing dangerous stuff, is still purely up to humans.
Ask the AI to document each module in a 100-line markdown. These should be very high level, don't contain any detail, but just include pointers to relevant files for AI to find out by itself. With a doc as the starting point, AI will have context to work on any module.
If the module just can't be documented in this way in under 100 lines, it's a good time to refactor. Chances are if Claude's context window is not enough to work with a particular module, a human dev can't either. It's all about pointing your LLM precisely at the context that matters.
Yeah but this collapses under any real complexity and there is likely an extreme amount of redundant code and would probably be twice as memory efficient if you just wrote it yourself.
Im actually interested to see if we see a rise in demand for DRAM that is greater than usual because more software is vibe coded than being not, or some form of vibe coding.
Yeah, and it's variable, can happen at 250k, 500k or later. When you interrogate it; usually the issue comes to it being laser focused or stuck on one specific issue, and it's very hard to turn it around.
For the lack of the better comparison it feels like the AI is on a spectrum...
Claude seems to have trouble with extracting code snippets to add to the context as the session gets longer and longer. I've seen it get stuck in a loop simply trying to use sed/rg/etc to get just a few lines out of a file and eventually give up.
> this is a preprint that has not been peer reviewed.
This conversation is peer review...
You don't need a conference for something to be peer reviewed, you only need... peers...
In fact, this paper is getting more peer review than most works. Conferences are notoriously noisy as reviewers often don't care and are happy to point out criticisms. All works have valid criticisms... Finding criticisms is the easy part. The hard part is figuring out if these invalidate the claims or not.
Honest question: does the opinion of Gary Marcus still count? His criticism seems more philosophical than scientific. It's hard for me see what he builds or reasons to get to his conclusions.
I think this is a fair assessment but reason, and intelligence dont really have an established control or control group. If you build a test and say "Its not intelligent because it can't..." and someone goes out and add's that feature in is it suddenly now intelligent?
If we make a physics break through tomorrow is there any LLM that is going to retain that knowledge permanently as part of its core or will they all need to be re-trained? Can we make a model that is as smart as a 5th grader without shoving the whole corpus of human knowledge into it, folding it over twice and then training it back out?
The current crop of tech doesn't get us to AGI. And the focus to make it "better" is for the most part a fools errand. The real winners in this race are going to be those who hold the keys to optimization: short retraining times, smaller models (with less upfront data), optimized for lower performance systems.
I actually agree with this. Time and again, I can see that LLMs do not really understand my questions, let alone being able to perform logical deductions beyond in-distribution answers. What I’m really wondering is whether Marcus’s way of criticizing LLMs is valid.
I don't know but the standard reply to all of Gary Marcus' criticisms is that they don't count because it's Gary Marcus, which of course is a big honking ad-hominem.
What gets me, and the author talks about it in the post, is that people will readily attribute correct answers to "its in the training set" but nobody says anything about incorrect answers that are in the training set. LLMs get stuff in the training set wrong all the time, but nobody uses it as evidence that it probably can't lean too hard on it's memorization for complex questions it does get right.
It puts LLMs in an impossible position; if they are right, they memorized it, if they are wrong, they cannot reason.
> It puts LLMs in an impossible position; if they are right, they memorized it, if they are wrong, they cannot reason.
Both of those can be true at the same time though. They memorize a lot of things, but its fuzzy and when they remember wrong they cannot fix it via reasoning.
It's more than fuzzy, they are packing exabytes, perhaps zetabytes of training data into a few terabytes. Without any reasoning ability it must be divine intervention that they ever get anything right...
It is divine intervention if you believe human minds are the product of a divine creator. Most of the attribution of miraculous reasoning ability on the part of LLMs I would attribute to pareidolia on the part of their human evaluators. I don’t think we’re much closer at all to having an AI which can replace an average minimum wage full-time worker, who will work largely unsupervised but ask their manager for help when needed, without screwing anything up.
We have LLMs that can produce copious text but cannot stop themselves from attempting to solve a problem they have no idea how to solve and making a mess of things as a result. This puts an LLM on the level of an overly enthusiastic toddler at best.
LLMs are trained with hundreds of terabytes of data to a few petabyte at most. You are off by 3 to 6 orders of magnitude in your estimate of training data. They aren't literally trained on "all the data of the internet". That would be a divergent nightmare. Catastrophic forgetting is still a problem with neural networks and ML algorithms in general. Humans are probably trained on less than half an exabyte of data given the ~1Gbps of sensory data we receive in a lifetime. That's still ~20 petabytes of data by age 5. A 400B parameter LLM with 100 examples per parameter would equal about 640 TB (F16 parameters) of training data. That's the order of magnitude of current models.
Do you hypothese that they see more wrong examples then right? Why is there concern about model collapse if they are reasoning and can sort it out, why does the data even need to be scrubbed before training?
Sorry - what do you mean by yud-cult? Searching google didn’t help me (as far as I can tell) - I view LW from an outside perspective as well, but don’t understand the reference
They're referring to the founder of that website, Eliezer Yudkowsky, who is controversial due to his 2023 Time article that called for a complete halt on the development of AI.
Yudkowsky is controversial for much more than an article from 2023.
Yudkowsky lacks credentials and MIRI and its adjacents have proven to be incestuous organizations when it comes to the rationalist cottage industry, one that has a serious problem with sexual abuse and literal cults.
It was not so much the call for a complete halt that caused controversy, but rather this part of his piece in Time (my emphasis):
"Make it explicit in international diplomacy that preventing AI extinction scenarios is considered a priority above preventing a full nuclear exchange, and that allied nuclear countries are willing to run some risk of nuclear exchange if that's what it takes to reduce the risk of large AI training runs.
That's the kind of policy change that would cause my partner and I to hold each other, and say to each other that a miracle happened, and now there's a chance that maybe [our daughter] will live."
No, he's controversial because he runs an online sex cult centered around the idea that 1. if you do a specific kind of math in your head instead of regular thinking you'll automatically be correct about everything 2. computers can do math faster than you 3. therefore computers are going to take over the world and enslave you 4. therefore you should move to Berkeley and live in a group home.
I understand that, but the guidelines don't get relaxed because the topic seems important to you. It's common for people to think that a particular issue is so important that the normal rules shouldn't apply or should be interpreted differently in that case, but we can't run the site like that.