I have no ambiant lighting.
I have my window opened or the CO2 level gets bad.
If I get lights, all the fucking insect existing in the forest will come in my room.
Or I can get a fresh breeze while being on my PC in the evening.
Hmm, wouldn't it sacrifice a better answer in some cases (not sure how many though)?
I'll be surprised if they hadn't specifically trained for structured "correct" output for this, in addition to picking next token following the structure.
In my experience (I've put hundreds of billions of tokens through structured outputs over the last 18 months), I think the answer is yes, but only in edge cases.
It generally happens when the grammar is highly constrained, for example if a boolean is expected next.
If the model assigns a low probability to both true and false coming next, then the sampling strategy will pick whichever one happens to score highest. Most tokens have very similar probabilities close to 0 most of the time, and if you're picking between two of these then the result will often feel random.
It's always the result of a bad prompt though, if you improve the prompt so that the model understands the task better, then there will then be a clear difference in the scores the tokens get, and so it seems less random.
It's not just the prompt that matters, it's also field order (and a bunch of other things).
Imagine you're asking your model to give you a list of tasks mentioned in a meeting, along with a boolean indicating whether the task is done. If you put the boolean first, the model must decide both what the task is and whether it is done at the same time. If you put the task description first, the model can separate that work into two distinct steps.
There are more tricks like this. It's really worth thinking about which calculations you delegate to the model and which you do in code, and how you integrate the two.
Grammars work best when aligned with prompt. That is, if your prompt gives you the right format of answer 80% of the time, the grammar will take you to a 100%. If it gives you the right answer 1% of the time, the grammar will give you syntactically correct garbage.
Sampling is already constrained with temperature, top_k, top_p, top_a, typical_p, min_p, entropy_penalty, smoothing etc. – filtering tokens to valid ones according to grammar is just yet another alternative. It does make sense and can be used for producing programming language output as well – what's the point in generating/bothering with up front know, invalid output? Better to filter it out and allow valid completions only.
No, that's a rumor lots of people have been taking at face value.
If you do the math, inferrence is very lucrative.
Here someone deployed a big model, the costs are $0.20/1M token
https://lmsys.org/blog/2025-05-05-large-scale-ep/
The article Zitron links says Cursor has single-digit millions of cash burn with about $1B in the bank (as of August). Assuming that is true, they are losing money but have a long runway.
That article says "Anysphere runs pretty lean with around 150 employees and has a single digit monthly cash burn, a source tells me." That would be total cash burn, i.e., net losses. If their AWS bill is bigger than that it's because they are making up for part of it with revenue.
Ed's mentioned ARR in previous articles and it's not a "generally accepted accounting principle". They cherry pick the highest monthly revenue number and multiply that by 12, but that's not their actual annual revenue.
"Cherry pick the highest" is misleading. If your revenue is growing 10% a month for a year straight and is not seasonal, picking any other than the most recent month to annualize would make no sense.
If a company's revenue in January is $100 and it grows by 10% every month, the December revenue is $285. The year's revenue would be about $2,138, but ARR in December would be $3,423. That's 1.6x the actual revenue.
ARR could be a useful tool to help predict future revenue, but why not simply report on actual revenue and suggest it might increase in the next year? I have found the most articles to be unclear to the reader about what ARR actually represents.
Why is the calendar year the relevant unit? If you insist on years, then if you consider the year from June to June, $2,138 would be misleading small.
The point of ARR is to give an up to date measure on a rapidly changing number. If you only report projected calendar year revenue, then on January 1 you switch from reporting 2025 annual revenue to 2026 projected revenue, a huge and confusing jump. Why not just report ARR every month? It's basically just a way of reporting monthly revenue — take the number you get and divide it by 12.
I am really skeptical that people are being bamboozled by this in some significant way. Zitron does far more confusing things with numbers in the name of critique.
You're correct, ARRs can be both misleading and for any 12-month period (I just chose a year to simplify), but the problem is AI companies tend to only release their latest ARR, and only selectively, which I believe is misleading in the opposite direction.
The "annual" just means that the unit of time is a year. It doesn't mean that it is recurring annually. You can call it Annualized Monthly Recurring Revenue if it makes you feel better.
Well people like Sam Altman have not been entirely honest and there's a reason they're not sharing their actual revenue numbers. If they could show they were growing 10% every month they would.
Eh, when you have a company that’s growing, picking the highest and annualizing it is sensible. If we had a mature company with highly seasonal revenue it would be dishonest.
I mean I think there are instances where OpenAI's revenue is seasonal. Lots of students using it during the school year and cancelling it during summer.
I think you missed the forest for the trees. I am sure the student population has some dropoff during summer months but the point is that for businesses that a growing month over month which most of these have since creation, you take the highest number (latest) and annualize it.
I am also willing to bet that the student dropoff is not pronounced. I am more thinking of a business that sells beach umbrellas, they make a lot of sales in the summer months and then next to nothing in the winter months. That would be dishonest.
I thought a human would be a considerable step up in complexity but I asked it first for a pelican[0] and then for a rat [1] to get out of the bird world and it did a great job on both.
But just fot thrills I also asked for a "punk rocker"[2] and the result--while not perfect--is leaps and bounds above anything from the last generation.
0 -- ok, here's the first hurdle! It's giving me "something went wrong" when I try to get a share link on any of my artifacts. So for now it'll have to be a "trust me bro" and I'll try to edit this comment soon.
I never understood the point of the pellican on a bicycle exercise:
LLMs coding agent doesnt have any way to see the output.
It means the only thing this test is testing, is the ability of the LLMs to memorise.
Because it excercises thinking about a pelican riding a bike (not common) and then describing that using SVG. It's quite nice imho and seems to scale with the power of the LLM model. Sure Simon has some actual reasons though.
I wouldn't say any LLMs are good at it. But it doesn't really matter, it's not a serious thing. It's the equivalent of "hello world" - or whatever your personal "hello world" is - whenever you get your hands on a new language.
Coordinate and shape of the element used to form a pellican.
If you think about how LLMs ingest their data, they have no way to know how to form a pellican in SVG.
I bet their ability to form a pellican result purely because someone already did it before.
> If you think about how LLMs ingest their data, they have no way to know how to form a pellican in SVG.
It's called generalization and yes, they do. I bet you could find plenty of examples of it working on something that truly isn't "present in the training data".
It's funny, you're so convinced that it's not possible without direct memorization but forgot to account for emergent behaviors (which are frankly all over the place in LLM's - where you been)?
At any rate, the pelican thing from simonw is clearly just for fun at this point.
With Dark Mode.
reply