Hacker Newsnew | past | comments | ask | show | jobs | submit | Kuinox's commentslogin

I have no ambiant lighting. I have my window opened or the CO2 level gets bad. If I get lights, all the fucking insect existing in the forest will come in my room. Or I can get a fresh breeze while being on my PC in the evening.

With Dark Mode.


The inference doesn't return a single token, but the probably for all tokens. You just select the token that is allowed according to the compiler.

Hmm, wouldn't it sacrifice a better answer in some cases (not sure how many though)?

I'll be surprised if they hadn't specifically trained for structured "correct" output for this, in addition to picking next token following the structure.


In my experience (I've put hundreds of billions of tokens through structured outputs over the last 18 months), I think the answer is yes, but only in edge cases.

It generally happens when the grammar is highly constrained, for example if a boolean is expected next.

If the model assigns a low probability to both true and false coming next, then the sampling strategy will pick whichever one happens to score highest. Most tokens have very similar probabilities close to 0 most of the time, and if you're picking between two of these then the result will often feel random.

It's always the result of a bad prompt though, if you improve the prompt so that the model understands the task better, then there will then be a clear difference in the scores the tokens get, and so it seems less random.


It's not just the prompt that matters, it's also field order (and a bunch of other things).

Imagine you're asking your model to give you a list of tasks mentioned in a meeting, along with a boolean indicating whether the task is done. If you put the boolean first, the model must decide both what the task is and whether it is done at the same time. If you put the task description first, the model can separate that work into two distinct steps.

There are more tricks like this. It's really worth thinking about which calculations you delegate to the model and which you do in code, and how you integrate the two.


Grammars work best when aligned with prompt. That is, if your prompt gives you the right format of answer 80% of the time, the grammar will take you to a 100%. If it gives you the right answer 1% of the time, the grammar will give you syntactically correct garbage.

Sampling is already constrained with temperature, top_k, top_p, top_a, typical_p, min_p, entropy_penalty, smoothing etc. – filtering tokens to valid ones according to grammar is just yet another alternative. It does make sense and can be used for producing programming language output as well – what's the point in generating/bothering with up front know, invalid output? Better to filter it out and allow valid completions only.

The "better answer" wouldnt had respected the schema in this case.

No, that's a rumor lots of people have been taking at face value. If you do the math, inferrence is very lucrative. Here someone deployed a big model, the costs are $0.20/1M token https://lmsys.org/blog/2025-05-05-large-scale-ep/

I could find an ARR for Cursor of $500M. Why do they they in this article that Cursor is loosing with this spending number ?


The article Zitron links says Cursor has single-digit millions of cash burn with about $1B in the bank (as of August). Assuming that is true, they are losing money but have a long runway.

https://www.newcomer.co/p/cursors-popularity-has-come-at-a


Single-digit cash burn on AWS, which the article says is only a small part of its compute, with the majority coming from Anthropic.


That article says "Anysphere runs pretty lean with around 150 employees and has a single digit monthly cash burn, a source tells me." That would be total cash burn, i.e., net losses. If their AWS bill is bigger than that it's because they are making up for part of it with revenue.


Ah, gotcha. I misunderstood your comment.


Ed's mentioned ARR in previous articles and it's not a "generally accepted accounting principle". They cherry pick the highest monthly revenue number and multiply that by 12, but that's not their actual annual revenue.


"Cherry pick the highest" is misleading. If your revenue is growing 10% a month for a year straight and is not seasonal, picking any other than the most recent month to annualize would make no sense.


If a company's revenue in January is $100 and it grows by 10% every month, the December revenue is $285. The year's revenue would be about $2,138, but ARR in December would be $3,423. That's 1.6x the actual revenue.

ARR could be a useful tool to help predict future revenue, but why not simply report on actual revenue and suggest it might increase in the next year? I have found the most articles to be unclear to the reader about what ARR actually represents.


Why is the calendar year the relevant unit? If you insist on years, then if you consider the year from June to June, $2,138 would be misleading small.

The point of ARR is to give an up to date measure on a rapidly changing number. If you only report projected calendar year revenue, then on January 1 you switch from reporting 2025 annual revenue to 2026 projected revenue, a huge and confusing jump. Why not just report ARR every month? It's basically just a way of reporting monthly revenue — take the number you get and divide it by 12.

I am really skeptical that people are being bamboozled by this in some significant way. Zitron does far more confusing things with numbers in the name of critique.


You're correct, ARRs can be both misleading and for any 12-month period (I just chose a year to simplify), but the problem is AI companies tend to only release their latest ARR, and only selectively, which I believe is misleading in the opposite direction.


Because that's a part of the generally accepted accounting principles: https://www.rightrev.com/gaap-revenue-vs-arr/

Nobody considers a year from June to June because that would be misleading.


That is an article explaining why ARR is useful and important despite not being the same thing as GAAP revenue.


How can you talk about ARR if you only look at 1 year?


It's useful for financial planning. Less useful for overall financial reporting given how volatile it is.


I should've been more clear. How can you talk about ARR if you only have 1 year?

How do you know it's recurring? What data do you have (historic) that makes you believe the revenue will happen again?

Is this based on signed contracts etc so you have some guarantees?


The "annual" just means that the unit of time is a year. It doesn't mean that it is recurring annually. You can call it Annualized Monthly Recurring Revenue if it makes you feel better.


Well people like Sam Altman have not been entirely honest and there's a reason they're not sharing their actual revenue numbers. If they could show they were growing 10% every month they would.


They are sharing their actual revenue numbers. That's what ARR is. Take the number and divide it by 12 and that's monthly revenue.


It's literally not ARR is the highest monthly revenue times 12. Dividing it by 12 doesn't get you the actual, on the books monthly revenue numbers.


Eh, when you have a company that’s growing, picking the highest and annualizing it is sensible. If we had a mature company with highly seasonal revenue it would be dishonest.


I mean I think there are instances where OpenAI's revenue is seasonal. Lots of students using it during the school year and cancelling it during summer.


The graph that was widely shared to make this claim was from OpenRouter and did not represent ChatGPT usage in any way.


I think you missed the forest for the trees. I am sure the student population has some dropoff during summer months but the point is that for businesses that a growing month over month which most of these have since creation, you take the highest number (latest) and annualize it.

I am also willing to bet that the student dropoff is not pronounced. I am more thinking of a business that sells beach umbrellas, they make a lot of sales in the summer months and then next to nothing in the winter months. That would be dishonest.


Then why aren't AI companies reporting their actual monthly revenues?


They are. That is what ARR is.


Over all, peoples prefer writing with lambda, somes use the SQL-like, but there is a major preference towards lambdas.


Just compare it with a human on a bicycle, you would see that LLMs are weirdly good at drawing pelicans in SVG but not humans.


I thought a human would be a considerable step up in complexity but I asked it first for a pelican[0] and then for a rat [1] to get out of the bird world and it did a great job on both.

But just fot thrills I also asked for a "punk rocker"[2] and the result--while not perfect--is leaps and bounds above anything from the last generation.

0 -- ok, here's the first hurdle! It's giving me "something went wrong" when I try to get a share link on any of my artifacts. So for now it'll have to be a "trust me bro" and I'll try to edit this comment soon.


I never understood the point of the pellican on a bicycle exercise: LLMs coding agent doesnt have any way to see the output. It means the only thing this test is testing, is the ability of the LLMs to memorise.

Edit: just to show my point, a regular human on a bicycle is way worse with the same model: https://i.imgur.com/flxSJI9.png


Because it excercises thinking about a pelican riding a bike (not common) and then describing that using SVG. It's quite nice imho and seems to scale with the power of the LLM model. Sure Simon has some actual reasons though.


> Because it excercises thinking about a pelican riding a bike (not common)

It is extremely common, since it's used on every single LLM to bench it.

And there is nothing logic, LLMs are never trained for graphics tasks, they dont see the output of a code.


I mean the real world examples of a pelican riding a bike is not common. It's common in benchmarking LLM's but that's not what I meant.


The only thing it exercises is the ability of the model to recall its pelican-on-bicycle and other SVG training data.


It's more for fun than as a benchmark.


It also measure something llms are good probably due to cheating.


I wouldn't say any LLMs are good at it. But it doesn't really matter, it's not a serious thing. It's the equivalent of "hello world" - or whatever your personal "hello world" is - whenever you get your hands on a new language.


Memorise what exactly?


Coordinate and shape of the element used to form a pellican. If you think about how LLMs ingest their data, they have no way to know how to form a pellican in SVG.

I bet their ability to form a pellican result purely because someone already did it before.


> If you think about how LLMs ingest their data, they have no way to know how to form a pellican in SVG.

It's called generalization and yes, they do. I bet you could find plenty of examples of it working on something that truly isn't "present in the training data".

It's funny, you're so convinced that it's not possible without direct memorization but forgot to account for emergent behaviors (which are frankly all over the place in LLM's - where you been)?

At any rate, the pelican thing from simonw is clearly just for fun at this point.


you can unzoom and then it's screenshotable.


The em dashes exists in ChatGPT output because existing human text contains it, like journal articles.


Remember, they only measured that the less time you spend on a task, the less you remember it.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: