I’m not sure if this is just an “on mobile” thing, but I can’t find any reference to ISO 27001 or SOC2 at that datacentres URL. Taking your word for it being there previously, this seems like a major red flag! Faking these certs is no joke, and silently removing references to that after being called out would be even more of a bad look.
@ybceo you seemed to represent this org based on your previous comments, is the parent commenter missing something here?
I get it, really I do. But do the HOAs really need financial enforcement mechanisms intended to seriously harm people, and to punish them as judge, jury and executioner? A HOA’s legal job is to maintain the common-interest property and enforce the CC&Rs. It is not a HOA’s job to extract enormous sums of money out of its members, even annoying ones. The right lever to pull to get some rich person partying at 4am and trashing the place (for example) to stop is for the HOA to file for a court injunction after repeated violations; once a judge orders “no loud music 10 pm - 7 am”, the next 4 am party will become contempt of court, which is a problem for the cops, not the HOA. Hell, 4 a.m. noise is a municipal nuisance and probably a crime; people should be calling the cops every time it happens. Individual members could even sue the owner in small-claims court for private nuisance, where judges can issue even more injunctions or award damages.
All this to say, you don’t need to take people’s money to get them to stop doing bad stuff. But you do need to take people’s money to get rich, and to hurt people. This new legislation should be deeply concerning to people interested in the latter, and IMO shouldn’t really be a concern to people interested in the former.
I don't know where you live, but calling cops over noise nuisance has not worked in most cities in the US for a long time. E.g. with LAPD you will be lucky if cops will show up in 4 hours and if they show up they are not going to ticket anybody. And there is nothing you can do about it. "Petty" crime is free-for-all in any city with a "restorative justice" DA. So we need to use other means to slow down our degrading quality of life.
>But do the HOAs really need financial enforcement mechanisms intended to seriously harm people, and to punish them as judge, jury and executioner?
No, they don't. But to be fair, your local enforcement agencies have the same power to unilaterally fine people insane amounts of money. So in a technical sense it makes sense that HOAs would have the same unilateral power to screw people.
1) Governments are often much easier to sway. You can get a newspaper or TV station involved. You can show up to open meetings. You can campaign against the incumbents. While you can porbably technically do some of that against rogue HOA boards, it's going to be a lot harder.
2) Governments are usually large enough not to make things a personal vendetta. That's clearly not always true; I'm only talking about trends. Meanwhile, the HOA members are your neighbors, by definition. Get on the wrong side of them and they can easily get involved in everything you do.
Ah, got it. You were saying neither part should do that. I interepreted that as HOAs should also be allowed to do that. I see what you're saying now, though.
You have to phrase it properly. One time when a neighbor had a school-/work-night party that lasted until after midnight, I went over and asked them to wrap it up. When they didn't, I called the police non-emergency line and asked them to go break it up. When we were still awake from noise an hour later, I called the police again, and told them that in 15 minutes I was going back over there myself. They asked me to please not do that, and took care of it within the next 10 minutes.
They were ambivalent about dealing with noise, but were happy to stave off a riot.
In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:
- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.
- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.
- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)
All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.
I would guess the “secret sauce” here is distillation: pretraining on an extremely high quality synthetic dataset from the prompted output of their state of the art models like o3 rather than generic internet text. A number of research results have shown that highly curated technical problem solving data is unreasonably effective at boosting smaller models’ performance.
This would be much more efficient than relying purely on RL post-training on a small model; with low baseline capabilities the insights would be very sparse and the training very inefficient.
It behooves them to keep the best stuff internal, or at least greatly limit any API usage to avoid giving the goods away to other labs they are racing with.
Which, presumably, is the reason they removed 4.5 from the API... mostly the only people willing to pay that much for that model were their competitors. (I mean, I would pay even more than they were charging, but I imagine even if I scale out my use cases--which, for just me, are mostly satisfied by being trapped in their UI--it would be a pittance vs. the simpler stuff people keep using.)
Or, you can say, OpenAI has some real technical advancements on stuff besides attn architecture. GQA8, alternating SWA 128 / full attn do all seem conventional. Basically they are showing us that "no secret sauce in model arch you guys just sucks at mid/post-training", or they want us to believe this.
Kimi K2 paper said that the model sparsity scales up with parameters pretty well (MoE sparsity scaling law, as they call, basically calling Llama 4 MoE "done wrong"). Hence K2 has 128:1 sparsity.
You are right. I mis-remembered the sparsity part of K2. The "done wrong" part I was thinking about how the scout -> maverick -> behemoth doesn't scale sparsity according to any formula (less sparse -> sparse -> less sparse).
It's convenient to be able to attribute success to things only OpenAI could've done with the combo of their early start and VC money – licensing content, hiring subject matter experts, etc. Essentially the "soft" stuff that a mature organization can do.
I think their MXFP4 release is a bit of a gift since they obviously used and tuned this extensively as a result of cost-optimization at scale - something the open source model providers aren't doing too much, and also somewhat of a competitive advantage.
Unsloth's special quants are amazing but I've found there to be lots of trade offs vs full quantization, particularly when striving for best first-shot attempts - which is by far the bulk of LLM use cases. Running a better (larger, newer) model at lower quantization to fit in memory, or with reduced accuracy/detail to speed it up both have value, but in the the pursuit of first-shot accuracy there doesn't seem to be many companies running their frontier models on reduced quantization. If openAI is in doing this in production that is interesting.
>They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool
They said it was native FP4, suggesting that they actually trained it like that; it's not post-training quantisation.
The native FP4 is one of the most interesting architectural aspects here IMO, as going below FP8 is known to come with accuracy tradeoffs. I'm curious how they navigated this and how the FP8 weights (if they exist) were to perform.
One thing to note is that MXFP4 is a block scaled format, with 4.25 bits per weight. This lets it represent a lot more numbers than just raw FP4 would with say 1 mantissa and 2 exponent bits.
I don't know how to ask this without being direct and dumb: Where do I get a layman's introduction to LLMs that could work me up to understanding every term and concept you just discussed? Either specific videos, or if nothing else, a reliable Youtube channel?
What I’ve sometimes done when trying to make sense of recent LLM research is give the paper and related documents to ChatGPT, Claude, or Gemini and ask them to explain the specific terms I don’t understand. If I don’t understand their explanations or want to know more, I ask follow-ups. Doing this in voice mode works better for me than text chat does.
When I just want a full summary without necessarily understanding all the details, I have an audio overview made on NotebookLM and listen to the podcast while I’m exercising or cleaning. I did that a few days ago with the recent Anthropic paper on persona vectors, and it worked great.
One has to be aware of the possibility of hallucinations, of course. But I have not encountered any hallucinations in these sorts of interactions with the current leading models. Questions like "what does 'embedding space' mean in the abstract of this paper?" yield answers that, in my experience, make sense in the context and check out when compared with other sources. I would be more cautious if I were using smaller models or if I were asking questions about obscure information without supporting context.
Also, most of my questions are not about specific facts but about higher-level concepts. For ML-related topics, at least, the responses check out.
There is a great 3blue1brown video, but it’s pretty much impossible by now to cover the entire landscape of research. I bet gpt-oss has some great explanations though ;)
Try Microsoft's "Generative AI for Beginners" repo on GitHub. The early chapters in particular give a good grounding of LLM architecture without too many assumptions of background knowledge. The video version of the series is good too.
Also: attention sinks (although implemented as extra trained logits used in attention softmax rather than attending to e.g. a prepended special token).
Editorial comment: It’s a bit weird to see AI-written (at least partially; you can see the usual em-dashes, it’s-not-X-it’s-Y) blog posts like this detract from an author’s true writing style, which in this case I found significantly more pleasant to read. Read his first ever post, and compare it to this one and many of the other recent posts: https://www.seuros.com/blog/noflylist-how-noflylist-got-clea...
I’m not much of a conspiracy theorist, but I could imagine a blog post almost identical to this one being generated in response to a prompt like “write a first-person narrative about: a cloud provider abruptly deleting a decade-old account and all associated data without warning. Include a plot twist”.
I literally cannot tell if this story is something that really happened or not. It scares me a little, because if this was a real problem and I was in the author’s shoes, I would want people to believe me.
Not AI-generated. Not everyone is born writing flawless English.
If it sounds like an LLM, maybe it is because people like me had to learn how to write clearly from LLMs because English is not our first language.
I could’ve written in my native tongue, but then someone else will have complained that not how english is structured.
Also, the story is real. Just because it is well-structured doesn't mean it's fiction. Yes, i used AI to resort it, but i can assure you that no AI will generate the Piers Morgan reference.
It’s easy to be fooled, myself included it seems :)
For context, here’s a handful of the ChatGPT cues I see.
- “wasn’t just my backup—it was my clean room for open‑source development”
- “wasn’t standard AWS incompetence; this was something else entirely”
- “you’re not being targeted; you’re being algorithmically categorized”
- “isn’t a system failure; the architecture and promises are sound”
- “This isn’t just about my account. It’s about what happens when […]”
- “This wasn’t my production infrastructure […] it was my launch pad for updating other infrastructure”
- “The cloud isn’t your friend. It’s a business”
I counted about THIRTY em-dashes, which any frequent generative AI user would understand to be a major tell. It’s got an average word count in each sentence of around ~11 (try to write with only 11 words in each sentence, and you’ll see why this is silly), and much of the article consists of brief, punchy sentences separated by periods or question marks, which is the classic ChatGPT prose style. For crying out loud, it even has a table with quippy one-word cell contents at the end of the article like what ChatGPT generates 9/10 times when asked for a comparison of two things.
It’s just disappointing. The author is undermining his own credibility for what would otherwise be a very real problem, and again, his real writing style when you read his actual written work is great.
I got the sense the author wrote the post in collaboration with LLMs as a way of processing the experience:
> I was alone. Nobody understood the weight of losing a decade of work. But I had ChatGPT, Claude, and Grok to talk to
> To everyone who worked on these AIs, who contributed to their training data—thank you. Without you, this post might have been a very different kind of message.
It sounded like perhaps this post would have conveyed a message the author didn't think constructive if they wrote it entirely themselves
It's not the case in this article, but we're starting to see a lot of comments and blog posts from people who have something worthwhile to contribute, but who aren't confident enough in their English proficiency to post their thoughts without running them through an LLM for cleanup.
IMHO that's a good thing, something that should be encouraged. Counting the em dashes is just an exercise in missing the forest for the trees. Accusing someone of posting AI slop without evidence should be treated no differently here than accusing them of shilling or sock-puppetry. In other words, it should be prohibited by the site guidelines.
>I counted about THIRTY em-dashes, which any frequent generative AI user would understand to be a major tell.
Dude, plenty of people write with em-dashes and semicolons; I personally use them constantly (and I don't use LLMs at all). Em-dashes are trivial to type on MacOS (Alt+Shift+Dash) and even on Windows—I used to have the alt code muscle memorized (Alt+0151) but now I just use the Mac version with an AutoHotkey script. I get being wary of LLM spam now that it's pretty much everywhere, but this is not the "tell" you think it is.
To be clear, you're free to dislike this writing style, but I'm 100% confident that it has been common since long before LLMs were in widespread usage.
> his real writing style when you read his actual written work is great.
You're doubling down on this not being "his real writing style" despite acknowledging you were wrong about this being written by ChatGPT?
Yeah, that bullshit has been debunked all over the net for quite a while now; there are lots of us old farts who write exactly like that, and none of us is an “AI”. Idunno, maybe on aggregate we're so prolific that the material they're training the LLMs on contains a disprodportionate amount of our style or something.
Anyway, your talk about the author “undermining his own credibility” for writing like many humans just because some simplistic fake “intelligences” also write like those humans is totally wrong. The solution is not for people to have to change their writing style, but for you gullibles to stop thinking this is “a tell” — because it isn't.
Isn't that part of AI, simply because that's how the patterns work, and how we're taught to write in the writing classes?
BTW, I actually use the em-dash symbols very frequently myself — on a Mac and on Android, it's very easy through the standard keyboard, with Option-Dash being the shortcut on the Mac.
Not sure why you're downvoted - this was exactly my thought reading it. I spend a significant portion of my time reading human-generated and AI-generated long-form writing and I can very easily see the AI stuff.
But maybe it doesn't matter any more? Most people can't.
I started doing some experimentation with this new Deep Think agent, and after five prompts I reached my daily usage limit. For $250 USD/mo that’s what you’ll be getting folks.
It’s just bizarrely uncompetitive with o3-pro and Grok 4 Heavy. Anecdotally (from my experience) this was the one feature that enthusiasts in the AI community were interested in to justify the exorbitant price of Google’s Ultra subscription. I find it astonishing that the same company providing free usage of their top models to everybody via AI Studio is nickel-and-diming their actual customers like that.
Performance-wise. So far, I couldn’t even tell. I provided it with a challenging organizational problem that my business was facing, with the relevant context, and it proposed a lucid and well-thought-out solution that was consistent with our internal discussions on the matter. But o3 came to an equally effective conclusion for a fraction of the cost, even if it was less “cohesive” of a report. I guess I’ll have to wait until tomorrow to learn more.
They might not have been ready/optimized for production, but still wanted to release it before Aug 2 EU AI Act, this way they have 2 years for compliance. So the strategy with aggressively rate-limit for few users make sense.
Several years ago I thought a good litmus test for mastery of coding is not finding a solution using internet search nor getting well written questions about esoteric coding problems answered on StackOverflow. For a while, I would post a question and answer my own question after I solved the problem for posterity (or AI bots). I always loved getting the "I've been working on this for 3 days and you saved my life" comments.
I've been working on a challenging problem all this week and all the AI copilot models are worthless helping me. Mastery in coding is being alone when nobody else nor AI copilots can help you and you have dig deep into generalization, synthesis, and creativity.
(I thought to myself, at least it will be a little while longer before I'm replaced with AI coding agents.)
Your post misses the fact that 99% of programming is repetitive plumbing and that the overwhelming majority of developers, even ivy league graduates, suck at coding and problem solving.
Thus, AI is a great productivity tool if you know how to use it for the overwhelming majority of problems out there. And it's a boost even for those that are not even good at the craft as well.
This whole narrative of "okay but it can't replace me in this or that situation" is honestly between an obvious touche (why would you think AI would replace rather than empower those who know their craft) and stale luddism.
Even IF that were true (and I'd argue that it is NOT, and it's people who believe that and act that way who produce the tangled messes of spiderweb code that are utterly opaque to public searches and AI analysis -- the supposed "1%"), if even as low as 1% of the code I interacted with was the kind of code that required really deep thought and analysis, it could easily balloon to take up as much time as the other "99%".
Oh, and Ned Ludd was right, by the way. Weavers WERE replaced by the powered loom. It is in the interest of capital to replace you if they are able to, not to complement you, and furthermore, the teeth of capital have gotten sharper over time, and its appetite more voracious.
Capital is also willing to have vastly lower quality and burden the remaining labor with more toil in exchange for even lower costs. Velocity will rise, quality will fall, toil will increase leading to more burnout but there will be more expendable bodies to cycle through the slop cleanup farm.
Even if most of the code you write is solving repetitive plumbing tasks, today's models are incredibly bad at API design taste. IMO designing software in a way that minimizes side effects and is easy to change and test is more than 1% of software engineering.
Lately most of the code I write has been through LLMs and I find them an enormous productivity booster overall, but despite the benchmarks they're not expert human level quite yet, and they need a LOT of coaxing to produce production quality code.
As far as things LLMs are bad at, I think it's mainly the long tail. I'm not sure there's one singular thing that >1% of programmers work on that LLMs suck at, but I think there are thousands of different weird sub-specialties that almost no one is working on and very little public code exists for, thus LLMs are not good at them yet.
Try using any AI tool to write a working realtime GI (global illumination) implementation. I've been working on a novel implementation for 60fps/1080p GI and everytime I use Copilot or Claude to even try fixing a minor bug/troubleshoot it nukes entire functions and rewrites them using garbled shader code, old syntax/methods.
Puts things into stark perspective for me.
PS. no amount of prompt engineering will save you in this endeavour.
I've started to come to the conclusion that only greenfield projects consist of repetitive plumbing. Legacy software is like plumbing if all the pipes were tied into a knot. The edge cases, ambiguous naming, hacky solutions, etc. all make for a miserable experience, both for humans and AIs.
They're remarkably useless on stuff they've seen but not had up-weighted in the training set. Even the best ones (Opus 4 running hot, Qwen and K2 will surprise you fairly often) are a net liability in some obscure thing.
Probably the starkest example of this is build system stuff: it's really obvious which ones have seen a bunch of `nixpkgs`, and even the best ones seem to really struggle with Bazel and sometimes CMake!
The absolute prestige high-end ones running flat out burning 100+ dollars a day and it's a lift on pre-SEO Google/SO I think... but it's not like a blowout vs. a working search index. Back when all the source, all the docs, and all the troubleshooting for any topic on the whole Internet were all above the fold on Google? It was kinda like this: type a question in the magic box and working-ish code pops out. Same at a glory-days FAANG with the internal mega-grep.
I think there's a whole cohort or two who think that "type in the magic box and code comes out" is new. It's not new, we just didn't have it for 5-10 years.
I have similar issues with support form companies that heavily push AI and self-serve models and make human support hard. I'm very accomplished and highly capable. If I feel the need to turn to support, the chances the solution is in a KB is very slim, same with AI. It'll be a very specific situation with a very specific need.
There are a lot of internal KB's companies keep to themselves in their ticketing systems - would be interesting to estimate how much good data there is in there that could in the future be used to train more advanced (or maybe more niche or specific) AI models.
They COULD be great AI chatbots with good data, but in general what is being deployed is just crap. The cheapest app they can get set up as fast as possible so they can check a box, there's no real concern to make it good. I had a talk with my last CEO in January about a similar project, I gave a plan on how to do it right. They instead tried to half ass it and it completely failed. But they don't care because the board member that wanted it is happy because it exists.
This has been my thought for a long time - unless there is some breakthrough in AI algo I feel like we are going to hit a "creativity wall" for coding (and some other tasks).
Off the thousands of responses I have read from the top LLMs in the last couple of years: never seen one that was creative. Throwing writing, coding, problem solving, mathematical questions and what not.
It's somewhat easier to perceive the creativeless aspect with stable diffusion. I'm not talking about the missing limb or extra finger glitches. With a bit of experience looking through generated images our brain eventually perceives the absolute lack of creativity, an artist probably spot it without prior experience with generative AI pieces. With LLMs it takes a bit longer.
Anecdotal, baseless I guess. Papers were published, some researchers in the fields of science couldn't get the best LLMs to solve any unsolved problem. I recently came across a paper stating bluntly that all LLMs tested were unable to conceptualize, nor derive laws that generalize whatsoever. E.g formulas.
We are being duped, it doesn't help selling $200 monthly subscriptions - soon for even more - if marketers admitted there is absolutely zero reasoning going on with these stochastic machines on steroids.
I deeply wish the circus ends soon, so that we can start focusing on what LLMs are excellent, well fitted to do better, faster than humans.
Everyone I talked to, knowledgeable in machine learning and/or deep learning, who had no reason to pretend of course, agreed an LLM is a stochastic machines. That it is couples with very good other NLP techniques doesn't change that.
It is why even the best models today miss the shot by a large margin, then hit a decent match. Again back to the creative issue. If it was done before, a good input and a well trained model on good data will output good data likely to match with the best (matching) answer ever produced. Some NLP to make it sound unique is not equal to creativity.
Currently I'm porting the Playwright / Puppeteer client API to run in a Chrome extension without using the Chrome DevTools Protocol (CDP). Since there is 100s of Chrome extension AI copilots using Playwright with CDP with all the problems that come with that, I believe my library can be very useful. The vscode copilot chat with any model is always trying to evaluate strings in the content script using chrome.scripting.executeScript with eval('') and new Function('') which violates the CSP policy in MV3. The use case is novel but calling executeScript is common and this policy has been enforced for the last 2.5 years and available for a couple before that. Worse is it will convert my valid function definitions to eval() and new Function('') even though it has nothing to do with prompt.
> It’s just bizarrely uncompetitive with o3-pro and Grok 4 Heavy.
In my experience Grok 4 and 4 Heavy have been crap. Who cares how many requests you get with it when the response is terrible. Worst LLM money I’ve spent this year and I’ve spent a lot.
It's interesting how multi-dimensional LLM capabilities have proven to be.
OpenAI reasoning models (o1-pro, o3, o3-pro) have been the strongest, in my experience, at harder problems, like finding race conditions in intricate concurrency code, yet they still lag behind even the initial sonnet 3.5 release for writing basic usable code.
The OpenAI models are kind of like CS grads who can solve complex math problems but can't write a decent React component without yadda-yadda-ing half of it, while the Anthropic models will crank out many files of decent, reasonably usable code while frequently missing subtleties and forgetting the bigger picture.
I hear this repeated so many times I feel like its a narrative pushed by the sellers. Year ago you could ask for glass of wine filled to the brim and you just wouldnt get it. It wasnt garbage in, garbage out, it was sensibility in, garbage out.
The line where chatbots stop being sensible and start outputting garbage is in movement, but slower than avg joe would guess. You only notice it when you get an intuition of the answer before you see it, which requires a lot of experience on range of complexity. Persisten newbies are the best spotters, because they ask obvious basic questions while asking for stuff beyond what geniuses could solve, and only by getting garbage answer and enduring a process of realizing its actually garbage they truly make wider picture of AI than even most powerusers, who tend to have more balanced querries.
But doesn’t happen the same with other tools. I’ll give the same exact prompt to all of LLMs I have access to and look at the responses for the best one. Grok is consistently the worst. So if it’s garbage in, garbage out, why are the other ones so much better at dealing with my garbage?
It's not particularly interesting if Deep Mind comes to the same (correct) conclusion on a single problem as o3 but costs more. You could ask gpt 2.5 and gpt4 what 1+1= and would get the same response with gpt 4 costing more, but this doesn't tell us much about model capability or value.
It would be more interesting to know if it can handle problems that o3 can't do, or if it is 'correct' more often than o3 pro on these sort of problems.
i.e. if o3 is correct 90% of the time, but deep mind is correct 91% of the time on challenging organisational problems, it will be worth paying $250 for an extra 1% certainty (assuming the problem is high-value / high-risk enough).
By finding and testing problems that o3 can't do on Deep Think, and also testing the reverse? Or by large benchmarks comparing a whole suite of questions with known answers.
Problems that both get correct will be easy to find and don't say much about comparative performance. That's why some of the benchmarks listed in the article (e.g. Humanity's Last Exam / AIME 2025) are potentially more insightful than one person's report on testing one question (which they don't provide) where both models replied with the same answer.
> I find it astonishing that the same company providing free usage of their top models to everybody via AI Studio is nickel-and-diming their actual customers like that.
I agree that’s not a good posture, but it is entirely unsurprising.
Google is probably not profiting from AI Ultra customers either, and grabbing all that sweet usage data from the free tier of AI Studio is what matters most to improve their models.
Giving free access to the best models allows Google to capture market share among the most demanding users, which are precisely the ones that will be charged more in the future. In a certain sense, it’s a great way for Google to use its huge idle server capacity nowadays.
I'm burning well over 10 millions tokens a day on free tier. 99% of the input is freely availzble data, the rest is useless. I never provided any feedback. Sure there is some telemetry, they can have it.
I doubt I'm an isolated case. This Gemini gig will cost Google a lot, they pushed it on all android phones around the globe. I can't wait to see what happens when they have to admit that not many people will pay over 20 bucks for "Ai", and I would pay well over 20 bucks just to see the face of the c suite next year when one dares to explain in simple terms there is absolutely no way to recoup the DC investment and that powering the whole thing will cost the company 10 times that.
Similar complaints are happening all over reddit with the Claude Code $200/mo plan and Cursor. The companies with deep VC funding have been subsidizing usage for a year now, but we're starting to see that bleed off.
I think the primary concern of this industry right now is how, relative to the current latest generation models, we simultaneously need intelligence to increase, cost to decrease, effective context windows to increase, and token bandwidths to increase. All four of these things are real bottlenecks to unlocking the "next level" of these tools for software engineering usage.
Google isn't going to make billions on solving advanced math exams.
Agreed, and big context windows are key to mass adoption in wider use cases beyond chatbots (random ex: in knowledge management apps, being able to parse the entire note library/section and hook it into global AI search), but those use cases are decidedly not areas where $200 per month subscriptions can work.
I'll hazard to say that cost and context windows are the two key metrics to bridge that chasm with acceptable results.... As for software engineering though, that cohort will be demanding on all front for the foreseeable future, especially because there's a bit of a competitive element. Nobody wants to be the vibecoder using sub-par tools compared to everyone else showing off their GitHub results and making sexy blog posts about it on HN.
Outside of code, the current RAG strategy is throw shit tons of unstructured text at it that has been found using vector search. Some companies are doing better, but the default rag pipelines are... kind of garbage.
For example, a chat bot doing recipe work should have a RAG DB that, by default, returns entire recipes. A vector DB is actually not the solution here, any number of traditional DBs (relational or even a document store) would work fine. Sure do a vector search across the recipe texts, but then fetch the entire recipe from someplace else. Current RAG solutions can do this, but the majority of RAG deployments I have seen don't bother, they just abuse large context windows.
Which looks like it works, except what you actually have in your context window is 15 different recipes all stitched together. Or if you put an entire recipe book into the context (which is perfectly doable now days!), you'll end up with the chatbot mixing up ingredients and proportions between recipes because you just voluntarily polluted its context with irrelevant info.
Large context windows allow for sloppy practices that end up making for worse results. Kind of like when we decided web servers needed 16 cores and gigs of RAM to run IBM Websphere back in the early 2000s, to serve up mostly static pages. The availability of massive servers taught bad habits (huge complicated XML deployment and configuration files, oodles of processes communicating with each other to serve a single page, etc).
Meanwhile in the modern world I've ran mission critical high throughput services for giant companies on a K8 cluster consisting of 3 machines each with .25 CPU and a couple hundred megs of RAM allocated.
IMO: Context engineering is a fascinating topic because it starts approaching the metaphysical abstract idea of what LLMs even are.
If you believe that an LLM is a digital brain, then it follows that their limitation in capabilities today are a result of their limited characteristics (namely: coherent context windows). If we increase context windows (and intelligence), we can simply pack more data into the context, ask specific questions, and let the LLM figure it out.
However, if you have a more grounded belief that, at best, LLMs are just one part of a more heterogeneous digital brain: It follows that maybe actually their limitations are a result of how we're feeding it data. That we need to be smarter about context engineering, we need to do roundtrips with the LLM to narrow down what thbe context should be, it needs targeted context to maximize the quality of its output.
The second situation feels so much harder, but more likely. IMO: This fundamental schism is the single reason why ASI won't be achieved on any timeframe worth making a prediction about. LLMs are just one part of the puzzle.
We all talk a lot about #2, but until we get a really good grip on #1, I think we as a field are going to hit a progress wall.
The problem is we have not been able to separate out knowledge embedded in parameters with model capability, famously even if you don't want a model to write code, throwing a bunch of code at a model makes it a better model. (Also famously, even if someone never grows up to work with math day to day, learning math makes them better at all sorts of related logical thinking tasks.)
Also there is plenty of research showing performance degrades as we stuff more and more into context. This is why even the best models have limits on tool call performance when naively throwing 15+ JSON schemas at it. (The technique to use RAG to determine which tool call schema to feed into the context window is super cool!)
I wonder if the next phase for leveraging LLMs against large sets of contextual, proprietary data (code repositories & knowledge bases come to mind) is going to look more like smaller models highly (and regularly) trained/fine-tuned against that proprietary data (that is maybe delegated tasks by the ultra-sized internet scale omni-brain models)
If I'm asking Sonnet to agentically make this signin button green: does it really matter that it can also write haikus about the japanese landscape? That links back to your point: We don't have a grip, nearly at all, on how much this crosstalk between problem domains matters. Maybe it actually does matter? But certainly most of it doesn't. B
We're so far from the endgame on these technologies. A part of me really feels like we're wasting too much effort and money on training ASI ultra internet scale models. I'm never going to pay $200+/mo for even a much smarter Claude; what I need is a system that knows my company's code like the back of its hand, knows my company's patterns, technologies, and even business (Jira boards, Google docs, etc), and extrapolates from that. That would be worth thousands a month; but what I'm describing isn't going to be solved by a 195 IQ gigabrain, and it also doesn't feel like we're going to get there with context engineering.
It's also a question of general vs specialized tools. If LLMs are being used in a limited capacity, such as retrieving recipes, then a limited environment where it only has the ability to retrieve complete recipes via RAG may be ideal in the literal sense of the word. There really is nothing better than the perfect specialized tool for a specialized job.
I did embedded work for years. A 100mhz CPU with 1 cycle SRAM latency and a bare metal OS can do as much as a 600MHZ CPU hitting DRAM running a preemptive OS.
Big, coherent context windows are key to almost all use-cases. The whole house of cards RAG implementations most platforms are using right now are pretty bad. You start asking around about how to implement RAG and you realize: No one knows, the architecture and outcomes at every company are pretty bad, the most common words you hear are "yeah it pretty much works ok i guess".
> Similar complaints are happening all over reddit with the Claude Code $200/mo
I would imagine 95% of people never get anywhere near to hitting their CC usage. The people who are getting rate-limited have ten windows open, are auto-accepting edits, and YOLO'ing any kind of coherent code quality in their codebase.
Model routing is deceptively hard though. It has halting problem characteristics: often only the smartest model is smart enough to accurately determine a task's difficulty. And if you need the smartest model to reliably classify the prompt, it's cheaper to just let it handle the prompt directly.
This is why model pickers persist despite no one liking them.
The problem is that input token cost dominates output token cost for the majority of tasks.
Once you've given the model your prompt and are reading the first output token for classification, you've already paid most of the cost of just prompting it directly.
That said, there could definitely be exceptions for short prompts where output costs dominate input costs. But these aren't usually the interesting use cases.
No, you're talking about costs to user, which are oversimplifications of the costs that providers bear. One output token with a million input tokens is incredibly cheap for providers
Input tokens usually dominate output tokens by a lot more than 2x though. It’s often 10x or more input. It can even easily be 100x or more. Again in realistic workflows.
Caching does help the situation, but you always at least pay the initial cache write. And prompts need to be structured carefully to be cacheable. It’s not a free lunch.
For me personally (using mostly for coding and project planning) it's nearly always the case, including with thinking models. I'm usually pasting in a bunch of files, screenshots, etc., and having long conversations. Input nearly always heavily dominates output.
I don't disagree that there are hard problems which use short prompts, like math homework problems etc., but they mostly aren't what I would categorize as "real work". But of course I can only speak to my own experience /shrug.
Yeah coding is definitely a situation where context is usually very very large. But at the same time in those situations something like Sonnet is fine.
"I'm sorry but that wasn't a very interesting question you just asked. I'll spare you the credit and have a cheaper model answer that for you for free. Come back when you have something actually challenging."
Actually why not? Recognizing problem complexity as a fist step is really crucial for such expensive "experts". Humans do the same.
And a question to the knowledgeable: does a simple/stupid question cost more in terms of resources then a complex problem? in terms of power consumption.
IIRC that isn't possible under current models at least in general, for multiple reasons, including attention cannot attend to future tokens, the fact that they are existential logic, that they are really NLP and not NLU, etc...
Even proof mining and the Harrop formula have to exclude disjunction and existential quantification to stay away from intuitionist math.
IID in PAC/ML implies PEM which is also intentionally existential quantification.
This is the most gentle introduction I know of, but remember LLMs are fundamentally set shattering, and produce disjoint sets also.
We are just at reactive model based systems now, much work is needed to even approach this if it ever is even possible.
Hmm, I needed Claude 4’s help to parse your response. The critique was not too kind to your abbreviated arguments that current systems are not able to gauge the complexity of a prompt and the resources needed to address a question.
It feels like the rant of someone upset that their decades of formal logic approach to AI become a dead end.
I see this semi-regularly: futile attempts at handwaving away the obvious intelligence by some formal argument that is either irrelevant or inapplicable. Everything from thermodynamics — which applies to human brains too — to information theory.
Grey-bearded academics clinging to anything that might float to rescue their investment into ineffective approaches.
PS: This argument seems to be that LLMs “can’t think ahead” when all evidence is that they clearly can! I don’t know exactly what words I’ll be typing into this comment textbox seconds or minutes from now but I can — hopefully obviously — think intelligent thoughts and plan ahead.
PPS: The em-dashes were inserted automatically by my iPhone, not a chat bot. I assure you that I am a mostly human person.
I usually ignore ad hominem attacks but I am trying to convey a kindness here.
Who do you think is going to be successful, those who realize the limitations and strength of a system and leverage them, or those who are complacent, with a unwarranted self-satisfaction accompanied by unawareness of actual risks or deficiencies of a particular system?
IMHO they are always going to be too complex to know everything about a models of this size, but there are areas we do know their limits or the limits of computation in general.
But feel free to stay on your high horse and call people names and see how well that works out for you.
> ... Recognizing problem complexity as a first step...
Well, I don't think it's easy or even generally possible to recognize a problem complexity. Imagine you ask for a solution for a simple expressed statement like find an n > 2 where z^n = x^n + y^n. The answer you will receive will be based on a trained model with this well known problem but if it's not in the model it could be impossible to measure its complexity.
I know this is a joke but I have been able to lower my costs by routing my prompts to a smaller model to determine if I need to send it to a larger model or not.
It doesn't, it's not "1000 Gemini Pro" requests for free, Google misled everyone. It's 1000 Gemini requests, Flash included. You get like 5-7 Gemini Pro requests before you get limited.
Your link shows the free tier gets 100 Pro requests per day.
That matches my experience with a free account. With Gemini CLI it doesn't seem to matter if I log in with a Google Account or use an API key from AI Studio with billing disabled.
Yesterday I had two coding sessions in Gemini CLI with a total of 73 requests to Pro with no rate limiting.
I've found the free version swaps away from pro incredibly fast. But our company has gemini but can't even get that - we were being asked to do everything by API key.
They're using it as a major inducement to upgrade to AI Ultra. I mean, the image and video stuff is neat, but adds no value for the vast majority of AI subscribers, so right now this is the most notable benefit of paying 12x more.
FWIW, Google seems to be having some severe issues with oddball, perhaps malfunctioning quota systems. I'm regularly finding extraordinarily little use of gemini-cli is hitting the purported 1000 request limit, when in reality I've done less than 10.
I faced the exact same problem, with the API. It seems that it doesn't throttle early enough, then may cumulate the cool off period, malong it impossible to determine when to fire requests again.
Also, I noticed Gemini (even flash) has Google search support. But only via the web UI or the native mobile app. Via the API that would requires serp via MCP of sort. Even with Gemini pro.
Oh, some models are regularly facing outages. 503s are not uncommon. No SLA page, alerts, whatsoever.
The reasoning feature is buggy, even if disabled, it sometimes triggers anyway.
It occured to me the other day that Google probably have the best engineers given how good Gemini performs and where it's coming from, and the context window that is uniquely large compared to any other model. But that it is likely operated by managers coming from AWS where shipping half baked, barely tested software, was all it took to get a bonus.
I'm not in the AI sceptic camp (LLMs can be useful for some tasks, and I use them often), but this is the big issue at the moment.
In order for agentic AI to replace (for example) a software engineer, we need a big step up in capability, around an order of magnitude. These chain of thought models do get a bit closer to that, although in my opinion we're still a way away.
However, at the same time we need about an order of magnitude decrease in price. These models are expensive even at the current price tokens are sold at which seems to be below the actual cost. And these massive CoT models are taking us in completely the wrong direction in terms of cost
I'd be interested in tests involving tasks with large amounts of context. Parallel thinking could conceivably useful for a variety of specific problem types. Having more context than any specific chain of thought can reasonably attend to might be one of them.
I have ultra. Will not be renewing it. Useless, at least have global limits and let people decide how they want to use it. If I have tokens left, why can't I use it for code assist?
it turns out that AI at this level is very expensive to run (capex, energy).
my bet is that AI itself won't figure out how to overcome these constraints and reach escape velocity.
Perhaps this will be the incentive to finally get fusion working. Big tech megacorps are flush with cash and could fund this research many times over at current rates. E.g. NIF is several billion dollars; Google alone has almost $100B in the bank.
Mainframes are the only viable way to build computers. Micro processors will never figure out how to get small and fast enough for personal computers to reach escape velocity.
Hardware typically gets faster and cheaper over time. Unless we hit hard a wall because of physics then I don't see any reason that won't continue to be true.
The factories to make the better chips are themselves increasingly expensive; this is acceptable when their cost of construction can be amortised over more devices, but we're already at the point where the global poor get smartphones before safe water, so further factory cost increases can't really be assumed to be amortised better.
That said, current LLMs are not compute constrained, they're RAM and bandwidth constrained, so a (relatively) cheap factory that's dedicated just to filling a datacenter with hardware designed specifically for one particular AI architecture, that's something I think is plausible. As @tome accidentally reminded me about recently, the not-Musk Groq (https://groq.com/) is all about this.
Renewables will keep getting more efficient and cheaper to install, batteries will continue to get cheaper, at some point they'll crack fusion. Prices go negative or zero in several places already (West Texas wind energy overnight, solar in Chile). The question seems less "when will we get abundant, virtually free clean energy" and more "will we do it in time to avoid climate collapse".
our minds are incredibly energy efficient, that leads me to believe it is possible to figure out, but it might be a human rather than an AI that gives us something more akin to a biological solution.
This could fix my main gripe with The Matrix. ”Humans are used as batteries” always felt off, but it totally would make sense if the human brains have uniquely energy efficient pattern matching abilities that an emerging AI organism would harvest. That would also strengthen the spiritual humanist subtext.
Thats because the original script did actually have the human farms being used for brainpower for the machines. They changed it to "batteries" because they thought audiences at the time wouldnt understand it!
Gemini is consistently the only model that can reason over long context in dynamic domains for me. Deep Think just did that reviewing an insane amount of Claude Code logs - for a meta analysis task of the underlying implementation. Laughable to think Grok could do that.
I find it amusingly ironic how one comment under yours is pointing out that there’s a mistake in the model output, and the other comment under yours trusts that it’s correct but says that it isn’t “real reasoning” anyways because it knows the algorithm. There’s probably something about moving goalposts to be said here
If both criterion A and B need to be satisfied for something to be true, it’s not moving the goalposts for one person to point out A is not true, and another person to point out that B is not true.
What kind of focus do biopharma companies put on their stock prices? If a company like the one you described had a great treatment option that could genuinely help people and was raking in money by the boatload, is that “enough” for them as a “winning” business strategy regardless of how outside investors might perceive it?
Biotech companies raise money by selling shares. That's they go public so early compared to any other sector. The more suppressed your share price is, the harder it is to raise the money you need to do research and clinical trials.
Selling $1B of drugs might no necessarily mean they have sufficient free cash flow to do the things they want to do.
It leaves them vulnerable to takeover, for one. They have over $1B cash right now to pay for clinical trials in other markets as well as new indications but their valuation is about 5x that. Someone could leverage the $1B as part of a hostile takeover.
I find the situation the big LLM players find themselves in quite ironic. Sam Altman promised (edit: under duress, from a twitter poll gone wrong) to release an open source model at the level of o3-mini to catch up to the perceived OSS supremacy of Deepseek/Qwen. Now Qwen3’s release makes a model that’s “only” equivalent to o3-mini effectively dead on arrival, both socially and economically.
I don't think they will ever do an open-source release, because then the curtains would be pulled back and people would see that they're not actually state of the art. Lama 4 already sort of tanked Meta's reputation, if OpenAI did that it'd decimate the value of their company.
If they do open sourcing something, I expect them to open-source some existing model (maybe something useless like gpt-3.5) rather than providing something new.
OAI in general seems to be treading water at best.
Still topping a lot of leaderboards but severely reduced rep. Chaotic naming, „ClosedAI“ image, undercut on pricing, competitors with much better licensing/open weights, stargate talk about Europe, Claude being seen as superior for coding etc. nothing end of the world but a lot of lukewarm misses
If I was an investor with financials that basically require magical returns from them to justify Vals I’d be worried.
OpenAI has the business development side entirely fleshed out and that’s not nothing. They’ve done a lot of turns tuning models for things their customers use.
@ybceo you seemed to represent this org based on your previous comments, is the parent commenter missing something here?