Hacker Newsnew | past | comments | ask | show | jobs | submit | karpathy's commentslogin

Yes I noticed a few of these around. The LLM is a little too willing to give out grades for comments that were good/bad in a bit more general sense, even if they weren't making strong predictions specifically. Another thing I noticed is that the LLM has a very impressive recognition of the various usernames and who they belong to, and I think shows a little bit of a bias in its evaluations based on the identity of the person. I tuned the prompt a little bit based on some low-hanging fruit mistakes but I think one can most likely iterate it quite a bit further.

I think you were getting at this, but in case others didn't know: cstross is a famous sci-fi author and futurist :)

Thank you

It will work great with 40GB GPU, probably a bit less than twice slower. These are micro models of a few B param at most and fit easily during both training and inference.


How low can this go? Can this run on a 5090 card (32GiB)?


Set nproc_per_node-1 instead of 8 (or run the training script directly instead of using torchrun) and set device_batch_size=4 instead of 32. You may be able to use 8 with a 5090, but it didn't work on my 4090. However it's way slower than expected, one H100 isn't 250x the 4090, so I'm not sure it's training correctly. I'll let it run overnight and see if the outputs make any sense, maybe the metrics are not accurate in this config.


Still under development, remaining work includes tuning nanochat (current state being solid v0.1) and finalizing the in-between projects so that students can "unlock" all complexity that hides underneath: `torch.Tensor`, `torch.dist`, `.backward()`, '.compile()`, etc. And then the more ops heavy aspects.


What's the pricing for the course/EurekaLabs? P.s. thanks for all you're doing


Sorry I thought it would be clear and could have clarified that the code itself is just a joke illustrating the point, as an exaggeration. This was the thread if anyone is interested

https://chatgpt.com/share/68e82db9-7a28-8007-9a99-bc6f0010d1...


This part from the first try made me laugh:

      if random.random() < 0.01:

          logging.warning("This feels wrong. Aborting just in case.")

          return None


I actually laughed when I read that. This one got me, too. The casual validation of its paranoia gives me Marvin the Paranoid Android vibes.

  try:
      result = a / b
      if math.isnan(result):
          raise ArithmeticError("Result is NaN. I knew this would happen.")


I think that’s the funniest joke I’ve ever seen an LLM make. Which probably means it’s copied from somewhere.


"Why is a laser beam like goldfish? Because neither one can whistle." - Mike, The Moon is a Harsh Mistress


Fantastic book, just read it. Surprised no movie has been made.


If you haven't read Ursula Le Guin's "The Dispossessed", check it out too.

It's like a fine wine pairing for "The Moon is a Harsh Mistress."


The protagonists are libertarians with teenage harems, who fake an election and team up with with a sex pest. That's extremely reductive to the point of parody, but that will likely be the media coverage of it then moment someone reads the women and politics in the book.

If you completely excise anything too distasteful for a current-day blockbuster, but want a film about a space mining colony uprising you might as well just adapt the game Red Faction instead: have the brave heros blasting away with abandon at corpo guards, mad genetic experimenters and mercenaries and the media coverage can talk about how it's a genius deconstruction of Elon Musk's Martian dream or whatever.


You’d think some filmmaker would have run with the dystopian theme. The accuracy of the book’s predictions is impressive, even the location of the North American Space Defense Command. The biggest miss was people using wired telephones everywhere.


I liked it when I was 17 but have soured on it later after re-reading.

The only reason their libertarian revolution succeeds is because they have a centralised computer that secretly does everything for them.


> I liked it when I was 17

same with pretty much every scifi movie and book from my youth. What movies that wouldn't have been rendered ridiculous by the invention of the cellphone were done in by the hairstyles or fashion.


If you're an extensive user of ChatGPT, or if you can give it some material about yourself like say, a resume or a LinkedIn profile, ask it to roast you. It will be very specific to the content you give it. Be warned, it can be brutal.


Whoa dude! It was brutal, but highly constructive! Actually extremely helpful (and quite funny, though I have a high sense of humor about things so others might not appreciate some of it :-D)

This was my favorite line after asking it to review my resume and roast me:

> Structure & Flow: “Like Kubernetes YAML — powerful, but not human-readable.”

Some other good ones:

> Content & Tone: “You’re a CTO — stop talking like a sysadmin with a thesaurus.”

> Overall Impression: “This resume is a technical symphony… that goes on for too many movements.”

I've got some resume work to do haha


They meant roast you, not your resume.


So rehash of top comments in /r/roastme?


I came back to this comment just to thank you - I started off with Claude, feeding it my personal site, my résumé, the HN roast of me, etc. and it was super funny.

But then, I veered that same conversation into asking for GTM (go to market) advice, and it was actually really good. It actually felt tailored to me (unsurprisingly) and a lot more useful.

As always, I don't know whether this is a very light form of "ai psychosis" haha but still, super grateful for the advice. Cheers


Periodic reminder that there’s also HN Wrapped. [0]

[0]: https://hn-wrapped.kadoa.com


ooooh boy, gotta mentally prepare myself for this one

<press enter>

damn these ai's are good!

<begins shopping for new username>


"The user will start a comment with 'I'm a social libertarian but...' only to be immediately downvoted by both libertarians and socialists. The irony will not be lost on them, just everyone else."

I can't say I'm not impressed. That's very funny


>You voted with your feet and moved to Western Europe for better well-being, but you still won't vote with your cursor and use a browser other than Edge.

I love this and hate this at the same time.


Absolutely hilarious, and gives me some self awareness tbh


Spot on and I don't even mind.


It would not be shocking if LLMs are legitimately better at making jokes about tasks they are extensively trained on.


Years and years ago, the MongoDB Java driver had something like this to skip logging sometimes in one of its error handling routines.

   } catch (Exception e) {
                if (!((_ok) ? true : (Math.random() > 0.1))) {
                    return res;
                }

                final StringBuilder logError = (new StringBuilder("Server seen down: ")).append(_addr);

                /* edited for brevity: log the error */
 
https://github.com/mongodb/mongo-java-driver/blob/1d2e6faa80...


One of my earlier jobs a decade ago involved doing pipeline development and Jenkins administration for the on-site developer lab on one of the NRO projects, and I inserted a random build failure code snippet to test that pipelines could recover from builds that failed for unpredictable reasons, like a network error rather than anything actually wrong with the build. I had to do this on the real system because we didn't have funds for a staging environment for the dev environment, and naturally I forgot to get rid of it when I was done. So builds randomly failed for years after that before I remembered and fixed it.


If we’re talking about funny error msgs, a buddy of mine got this yesterday in salesforce. It’s not _that_ funny but pretty funny for Salesforce.

System.DmlException: Insert failed. First exception on row 0; first error: UNKNOWN_EXCEPTION, Something is very wrong: []


I think there’s always a danger of these foundational model companies doing RLHF on non-expert users, and this feels like a case of that.

The AIs in general feel really focused on making the user happy - your example, and another one is how they love adding emojis to the stout and over-commenting simple code.


This feels like RLVR, not RLHF.

With RLVR, the LLM is trained to pursue "verified rewards." On coding tasks, the reward is usually something like the percentage of passing tests.

Let's say you have some code that iterates over a set of files and does processing on them. The way a normal dev would write it, an exception in that code would crash the entire program. If you swallow and log the exception, however, you can continue processing the remaining files. This is an easy way to get "number of files successfully processed" up, without actually making your code any better.


> This is an easy way to get "number of files successfully processed" up, without actually making your code any better.

Well, it depends a bit on what your goal is.

Sometimes the user wants to eg backup as many files as possible from a failing hard drive, and doesn't want to fail the whole process just because one item is broken.


You're right, but the way to achieve this is to allow the error to propagate at the file level, then catch it one function above and continue to the next one.

However, LLM generated code will often, at least in my experience, avoid raising any errors at all, in any case. This is undesirable, because some errors should result in a complete failure - for example, errors which are not transient or environment related but a bug. And in any case, a LLM will prefer turning these single file errors into warnings, though the way I see it, they are errors. They just don't need to abort the process, but errors nonetheless.


Yes, that's cleaner.

> And in any case, a LLM will prefer turning these single file errors into warnings, though the way I see it, they are errors.

Well, in general they are something that the caller should have opportunity to deal with.

In some cases, aborting back to the caller at the first problem is the best course of action. In some other cases, going forward and taking note of the problems is best.

In some systems, you might event want to tell the caller about failures (and successes) as they occur, instead of waiting until the end.

It's all very similar to the different options people have available when their boss sends them on an errand and something goes wrong. A good underling uses their best judgement to pick the right way to cope with problems; but computer programs don't have that, so we need to be explicit.

See https://en.wikipedia.org/wiki/Mission-type_tactics for a related concept in the military.


And more advanced users are more likely to opt out of training on their data, Google gets around it with a free api period where you can't opt out and I think from did some of that too, through partnerships with tool companies, but not sure if you can ever opt out there.


*grok, not 'from'


'over-commenting simple code' is preparing it for future agent work. pay attention to those comments to learn how you can better scaffold for agents.


They do seem to leave otherwise useless comments for itself. Eg: on the level of

// Return the result

return result;

I find this quite frustrating when reading/reviewing code generated by AI, but have started to appreciate that it does make subsequent changes by LLMs work better.

It makes me wonder if we'll end up in a place where IDEs hide comments by default (similar to how imports are often collapsed by default/automatically managed), or introduce some way of distinguishing between a more valuable human written comment and LLM boilerplate comments.


They should have a step to remove those sorts of comments, they only add noise to the code.


This is stunning English: "Perfect setup for satire. Here’s a Python function that fully commits to the bit — a traumatically over-trained LLM trying to divide numbers while avoiding any conceivable danger:" "Traumatically over-trained", while scoring zero google hits, is an amazingly good description. How can it intuitively know what "traumatic over-training" should mean for LLMs without ever having been taught the concept?


I don't know. It's a classic LLM-ism. "Traumatically over-X" is probably a common enough phrase. The prmpt says, "I don't know what labs are doing to these poor LLMs during RL," so the model connects that to some form of trauma. The training is traumatic, so the model is traumatically over-trained.

It sounds fine and flows nicely, but it doesn't quite make sense. Too much training over-fits an LLM; that's not what we're describing. Bad training might traumatize a model, but bad how? A creative response would suggest an answer to that question—perhaps the model has been made paranoid, scarred by repeat exposure to the subtlest and most severe bugs ever discovered—but the LLM isn't being creative. Its response has that spongy, plastic LLM texture that comes from the model rephrasing its prompt to provide a sycophantic preamble for the thing that was actually being asked for. It uses new words for the same old idea, and a bit of the precision is lost during the translation.


Eh, you are rationalizing. The phrase "traumatically over-X" is extremely rare. Any problem is easy after you've seen the solution. :) The solution "traumatically over-trained LLM" to the problem "What description best fits karpathy's description?" is certainly not easy to find. Connecting RL, poor LLMs, extreme fear, and welfare to excess training and severe lasting emotional pain is pretty darn impressive. E.g., I know exactly what situation karpathy describes is, but I couldn't in a million years put it into writing as succinctly and as precisely as the LLM.


> The phrase "traumatically over-X" is extremely rare.

There are plenty of "over-x" phrases in English associated with trauma or harm. Do a web search in quotes for "traumatic over{extension/exertion/stimulation}" (off the top of my head) and you'll get direct hits. And this isn't a Markov chain—its doesn't have to pull n-grams directly from its training material. That it could glue trauma and training into "traumatic over-training" is deeply unsurprising to me.

> I couldn't in a million years put it into writing as succinctly and as precisely as the LLM.

If that's the case, then (with respect) that may be down to your skills as a writer. The LLM puts it decently enough, but it's not very expressive and it doesn't add anything.

> Connecting RL, poor LLMs, extreme fear, and welfare to excess training and severe lasting emotional pain is pretty darn impressive

Is it? Really, we're just analogizing it to an abused pet. You over-train your dog, so it gets traumatized. The LLM connects the ideas and then synthesizes a lukewarm sentence to capture that connection at the cost of losing a degree of precision, because LLMs aren't animals. Models are good at those vector-embedding-style conceptual connections—I won't begrudge them that. Expressive use of language and fine-grained reasoning, though? Not so much.


Hard to know but if you could express "traumatically" as a number, and "over-trained" as a number, it seems like we'd expect "traumatically" + "over-trained" to be close to "traumatically over-trained" as a number. LLMs work in mysterious ways.


LLMs operate at token level, not word. it doesn't operate in terms of "traumatic", "over-training", "over" or "training", but rather "tr" "aum" "at" "ic, ", etc.


I think you are confusing tokens with vectors/embedding/parameters.

king and rex (king in latin) map to different tokens but will map to very similar vectors.


> it doesn't operate in terms of "traumatic", "over-training", "over" or "training", but rather "tr" "aum" "at" "ic, ", etc.

And "毛片免费观看" (Free porn movies), "天天中彩票能" (Win the lottery every day), "热这里只有精品" (Hot, only fine products here) etc[1].

[1]: https://news.ycombinator.com/item?id=45483924


Weird thing I've noticed.

Some LLMs can output nerd font glyphs and others can't.

If I recall grok code fast can but codex and sonnet can't


“Traumatic overtraining” does have hits though. My guess is that “traumatically” is a rarely used adverb, and “traumatic” is much more common. Possibly it completed traumatic into an adverb and then linked to overtraining which is in the training data. I dunno how these things work though.


You need to read more if you think that's stunning English


The same way that you and I think up a word and what it might mean without being taught the concept.

Adverb + verb


But the machines cannot possibly have the magic brain-juice!


> How can it intuitively know what "traumatic over-training" should mean for LLMs without ever having been taught the concept?

Because, and this is a hot take, LLMs have emergent intelligence


Or language has patterns


Kind of interesting it didn't add type hints though! You'd think for all that paranoia it would at least add type hints.



It was a great joke, that's why I posted it


<3


Omg long post. TLDR from an LLM for anyone interested

Speed your audio up 2–3× with ffmpeg before sending it to OpenAI’s gpt-4o-transcribe: the shorter file uses fewer input-tokens, cuts costs by roughly a third, and processes faster with little quality loss (4× is too fast). A sample yt-dlp → ffmpeg → curl script shows the workflow.

;)


Hahaha. Okay, okay... I will watch it now ;)

(Thanks for your good sense of humor)


I like that your post deliberately gets to the point first and then (optionally) expands later, I think it's a good and generally underutilized format. I often advise people to structure their emails in the same way, e.g. first just cutting to the chase with the specific ask, then giving more context optionally below.

It's not my intention to bloat information or delivery but I also don't super know how to follow this format especially in this kind of talk. Because it's not so much about relaying specific information (like your final script here), but more as a collection of prompts back to the audience as things to think about.

My companion tweet to this video on X had a brief TLDR/Summary included where I tried, but I didn't super think it was very reflective of the talk, it was more about topics covered.

Anyway, I am overall a big fan of doing more compute at the "creation time" to compress other people's time during "consumption time" and I think it's the respectful and kind thing to do.


I watched your talk. There are so many more interesting ideas in there that resonated with me that the summary (unsurprisingly) skipped over. I'm glad I watched it!

LLMs as the operating system, the way you interface with vibe-coding (smaller chunks) and the idea that maybe we haven't found the "GUI for AI" yet are all things I've pondered and discussed with people. You articulated them well.

I think some formats, like a talk, don't lend themselves easily to meaningful summaries. It's about giving the audience things to think about, to your point. It's the sum of storytelling that's more than the whole and why we still do it.

My post is, at the end of the day, really more about a neat trick to optimize transcriptions. This particular video might be a great example of why you may not always want to do that :)

Anyway, thanks for the time and thanks for the talk!


> I often advise people to structure their emails [..]

I frequently do the same, and eventually someone sent me this HBR article summarizing the concept nicely as "bottom line up front". It's a good primer for those interested.

https://hbr.org/2016/11/how-to-write-email-with-military-pre...


This is the sort of content I want to see in Tweets and LinkedIn posts.

I have been thinking for a while how do you make good use of the short space in those places.

LLM did well here.


that's a really good summary :)


Fun demo of an early idea was posted by Oriol just yesterday :)

https://x.com/OriolVinyalsML/status/1935005985070084197


My takeaway from the demo is less that "it's different each time", but more a "it can be different for different users and their styles of operating" - a poweruser can now see a different Settings UI than a basic user, and it can be generated realtime based on the persona context of the user.

Example use case (chosen specifically for tech): An IDE UI that starts basic, and exposes functionality over time as the human developer's skills grow.


On one hand, I'm incredibly impressed by the technology behind that demo. On the other hand, I can't think of many things that would piss me off more than a non-deterministic operating system.

I like my tools to be predictable. Google search trying to predict that I want the image or shopping tag based on my query already drives me crazy. If my entire operating system did that, I'm pretty sure I'd throw my computer out a window.


> incredibly impressed by the technology behind that demo

An LLM generating some HTML?


At a speed that feels completely seamless to navigate through. Yeah, I'm pretty impressed by that.


Read the code that is actually being generated. It's only the content of the page, which itself is loaded progressively.

It takes 2 seconds to generate an extremely basic 300 characters page of content. Again, what is impressive here?

It's not fast, it gives the illusion of being fast.


I know what it's doing and I'm impressed. If you understand what it's doing and aren't impressed, that's cool too. I think we just see things differently and I doubt either of us will convince the other one to change their mind on this


it's impressive but it seems like a crappier UX? that none of the patterns can really be memorized


I feel like one quickly hits a similar partial observability problem as with e.g. light sensors. How often do you wave around annoyed because the light turned off.

To get _truly_ self driving UIs you need to read the mind of your users. It's some heavy tailed distribution all the way down. Interesting research problem on its own.

We already have adaptive UIs (profiles in VSC anyone? Vim, Emacs?) they're mostly under-utilized because takes time to setup + most people are not better at designing their own workflow relative to the sane default.


This is crazy cool, even if not necessarily the best use case for this idea


I would bet good money that many of the functions they chose not to drill down into (such as settings -> volume) do nothing at all or cause an error.

It's a fronted generator. It's fast. That's cool. But is being pitched as a functioning OS generator and I can't help but think it isn't given the failure rates for those sorts of tasks. Further, the success rates for HTML generation probably _are_ good enough for a Holmes-esque (perhaps too harsh) rugpull (again, too harsh) demo.

A cool glimpse into what the future might look like in any case.


That looks both cool and infuriating


Having different documents come up every time you go into the documents directory seems hellishly terrible.


It's a brand of terribleness I've somewhat gotten used to, opening Google Drive every time, when it takes me to the "Suggested" tab. I can't recall a single time when it had the document I care about anywhere close to the top.

There's still nothing that beats the UX of Norton Commander.


Ah yes, my operating system, most definitely a place I want to stick the Hallucinotron-3000 so that every click I make yields a completely different UI that has absolutely 0 bearing to reality. We're truly entering the "Software 3.0" days (can't wait for the imbeciles shoving AI everywhere to start overusing that dogshit, made-up marketing term incessantly)


"Please don't fulminate."

"Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

"Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize."

https://news.ycombinator.com/newsguidelines.html


Maybe we can collect all of this salt and operate a Thorium reactor with it, this in turn can then power AI.


We'll need to boil a few more lakes before we get to that stage I'm afraid, who needs water when you can have your AI hallucinate some for you after all?


Who needs water when all these hot takes come from sources so dense, they're about to collapse into black holes.


Is me not wanting the UI of my OS to shift with every mouse click a hot take? If me wanting to have the consistent "When I click here, X happens" behavior instead of the "I click here and I'm Feeling Lucky happens" behavior is equal to me being dense, so be it I guess.


No. But you interpreting and evaluating the demo in question as suggesting the things you described - frankly, yes. It takes a deep gravity well to miss a point this clear from this close.

It's a tech demo. It shows you it's possible to do these things live, in real time (and to back Karpathy's point about tech spread patterns, it's accessible to you and me right now). It's not saying it's a good idea - but there are obvious seeds of good ideas there. For one, it shows you a vision of an OS or software you can trivially extend yourself on the fly. "I wish it did X", bam, it does. And no one says it has to be non-deterministic each time you press some button. It can just fill what's missing and make additions permanent, fully deterministic after creation.


I kind of say it in words (agreeing with you) but I agree the versioning is a bit confusing analogy because it usually additionally implies some kind of improvement. When I’m just trying to distinguish them as very different software categories.


What do you think about structured outputs / JSON mode / constrained decoding / whatever you wish to call it?

To me, it's a criminally underused tool. While "raw" LLMs are cool, they're annoying to use as anything but chatbots, as their output is unpredictable and basically impossible to parse programmatically.

Structured outputs solve that problem neatly. In a way, they're "neural networks without the training". They can be used to solve similar problems as traditional neural networks, things like image classification or extracting information from messy text, but all they require is a Zod or Pydantic type definition and a prompt. No renting GPUs, labeling data and tuning hyperparameters necessary.

They often also improve LLM performance significantly. Imagine you're trying to extract calories per 100g of product, but some product give you calories per serving and a serving size, calories per pound etc. The naive way to do this is a prompt like "give me calories per 100g", but that forces the LLM to do arithmetic, and LLMs are bad at arithmetic. With structured outputs, you just give it the fifteen different formats that you expect to see as alternatives, and use some simple Python to turn them all into calories per 100g on the backend side.


Even more than that. With Structured Outputs we essentially control layout of the response, so we can force LLM to go through different parts of the completion in a predefined order.

One way teams exploit that - force LLM to go through a predefined task-specific checklist before answering. This custom hard-coded chain of thought boosts the accuracy and makes reasoning more auditable.


I also think that structured outputs are criminally underused, but it isn't perfect... and per your example, it might not even be good, because I've done something similar.

I was trying to make a decent cocktail recipe database, and scraped the text of cocktails from about 1400 webpages. Note that this was just the text of the cocktail recipe, and cocktail recipes are comparatively small. I sent the text to an LLM for JSON structuring, and the LLM routinely miscategorized liquor types. It also failed to normalize measurements with explicit instructions and the temperature set to zero. I gave up.


have you tried schema-aligned parsing yet?

the idea is that instead of using JSON.parse, we create a custom Type.parse for each type you define.

so if you want a:

   class Job { company: string[] }
And the LLM happens to output:

   { "company": "Amazon" }
We can upcast "Amazon" -> ["Amazon"] since you indicated that in your schema.

https://www.boundaryml.com/blog/schema-aligned-parsing

and since its only post processing, the technique will work on every model :)

for example, on BFCL benchmarks, we got SAP + GPT3.5 to beat out GPT4o ( https://www.boundaryml.com/blog/sota-function-calling )


Interesting! I was using function calling in OpenAI and JSON mode in Ollama with zod. I may revisit the project with SAP.


    so if you want a:

       class Job { company: string[] }

    We can upcast "Amazon" -> ["Amazon"] since you indicated that in your schema.
Congratulations! You've discovered Applicative Lifting.


its a bit more nuanced than applicative lifting. parts of of SAP is that, but there's also supporting strings that don't have quotation marks, supporting recursive types, supporting unescaped quotes like: `"hi i wanted to say "hi""`, supporting markdown blocks inside of things that look like "json", etc.

but applicative lifting is a big part of it as well!

gloochat.notion.site/benefits-of-baml


Ok. Tried it, I'm not super impressed.

    Client: Ollama (phi4) - 90164ms. StopReason: stop. Tokens(in/out): 365/396
    ---PROMPT---
    user: Extract from this content:
    Grave Digger: 
     Ingredients
    
    - 1 1/2 ounces vanilla-infused brandy*
    
    - 3/4 ounce coffee liqueur
    
    - 1/2 ounce Grand Marnier
    
    - 1 ounce espresso, freshly brewed
    
    - Garnish: whipped cream
    
    - Garnish: oreo cookies, crushed
    
    Steps
    
    1.  Add all ingredients into a shaker with ice and shake until
        well-chilled.
    
    2.  Strain into a coupe.
    
    3.  Top with whipped cream and crushed Oreo cookies (discarding cream in
        center).
    
    *Vanilla-infused brandy: Cut 2 fresh vanilla pods lengthwise and place
    into a 750 mL bottle of brandy. Let sit for 3 to 5 days, shaking
    occasionally. Strain out pods and store.
    
    
    
    IngredientType
    ----
    - Alcohol
    - Sweetener
    - Sour
    - Aromatic
    - Bittering_agent
    - Food
    - Dilution
    
    LiquorType
    ----
    - Gin
    - Vodka
    - Rum
    - Whiskey
    - Tequila
    - Mezcal
    - Brandy
    - Cognac
    - Liqueur
    - Wine
    - Absinthe
    
    Answer in JSON using this schema:
    {
      Name: string,
      Ingredients: [
        {
          Text: string,
          Type: IngredientType,
          Liquor_type: LiquorType or null,
          Name_brand: string or null,
          Unit_of_measure: string,
          Measurement_or_unit_count: string,
        }
      ],
      Steps: [
        {
          Number: int,
          Instruction: string,
        }
      ],
      Garnishes: string[],
    }
    
    ---LLM REPLY---
    ```json
    {
      "Name": "Grave Digger",
      "Ingredients": [
        {
          "Text": "vanilla-infused brandy*",
          "Type": "Alcohol",
          "Liquor_type": "Brandy",
          "Name_brand": null,
          "Unit_of_measure": "ounces",
          "Measurement_or_unit_count": "1 1/2"
        },
        {
          "Text": "coffee liqueur",
          "Type": "Liqueur",
          "Liquor_type": "Liqueur",
          "Name_brand": null,
          "Unit_of_measure": "ounce",
          "Measurement_or_unit_count": "3/4"
        },
        {
          "Text": "Grand Marnier",
          "Type": "Liqueur",
          "Liquor_type": "Liqueur",
          "Name_brand": "Grand Marnier",
          "Unit_of_measure": "ounce",
          "Measurement_or_unit_count": "1/2"
        },
        {
          "Text": "espresso, freshly brewed",
          "Type": "Bittering_agent",
          "Liquor_type": null,
          "Name_brand": null,
          "Unit_of_measure": "ounces",
          "Measurement_or_unit_count": "1"
        }
      ],
      "Steps": [
        {
          "Number": 1,
          "Instruction": "Add all ingredients into a shaker with ice and shake until well-chilled."
        },
        {
          "Number": 2,
          "Instruction": "Strain into a coupe."
        },
        {
          "Number": 3,
          "Instruction": "Top with whipped cream and crushed Oreo cookies (discarding cream in center)."
        }
      ],
      "Garnishes": [
        "whipped cream",
        "oreo cookies, crushed"
      ]
    }
    ```
    ---Parsed Response (class Recipe)---
    {
      "Name": "Grave Digger",
      "Ingredients": [
        {
          "Text": "vanilla-infused brandy*",
          "Type": "Alcohol",
          "Liquor_type": "Brandy",
          "Name_brand": null,
          "Unit_of_measure": "ounces",
          "Measurement_or_unit_count": "1 1/2"
        },
        {
          "Text": "espresso, freshly brewed",
          "Type": "Bittering_agent",
          "Liquor_type": null,
          "Name_brand": null,
          "Unit_of_measure": "ounces",
          "Measurement_or_unit_count": "1"
        }
      ],
      "Steps": [
        {
          "Number": 1,
          "Instruction": "Add all ingredients into a shaker with ice and shake until well-chilled."
        },
        {
          "Number": 2,
          "Instruction": "Strain into a coupe."
        },
        {
          "Number": 3,
          "Instruction": "Top with whipped cream and crushed Oreo cookies (discarding cream in center)."
        }
      ],
      "Garnishes": [
        "whipped cream",
        "oreo cookies, crushed"
      ]
    }
Processed Recipe: { Name: 'Grave Digger', Ingredients: [ { Text: 'vanilla-infused brandy*', Type: 'Alcohol', Liquor_type: 'Brandy', Name_brand: null, Unit_of_measure: 'ounces', Measurement_or_unit_count: '1 1/2' }, { Text: 'espresso, freshly brewed', Type: 'Bittering_agent', Liquor_type: null, Name_brand: null, Unit_of_measure: 'ounces', Measurement_or_unit_count: '1' } ], Steps: [ { Number: 1, Instruction: 'Add all ingredients into a shaker with ice and shake until well-chilled.' }, { Number: 2, Instruction: 'Strain into a coupe.' }, { Number: 3, Instruction: 'Top with whipped cream and crushed Oreo cookies (discarding cream in center).' } ], Garnishes: [ 'whipped cream', 'oreo cookies, crushed' ] }

So, yeah, the main issue being that it dropped some ingredients that were present in the original LLM reply. Separately, the original LLM Reply misclassified the `Type` field in `coffee liqueur`, which should have been `Alcohol`.


appreciate you tyring it. the reason it dropped the day was due to your type system not being understood by the LLM you're using.

the model replied with

       {
          "Text": "coffee liqueur",
          "Type": "Liqueur",
          "Liquor_type": "Liqueur",
          "Name_brand": null,
          "Unit_of_measure": "ounce",
          "Measurement_or_unit_count": "3/4"
        },
but you expected a { Text: string, Type: IngredientType, Liquor_type: LiquorType or null, Name_brand: string or null, Unit_of_measure: string, Measurement_or_unit_count: string, }

there's no way to cast `Liqueur` -> `IngredientType`. but since the the data model is a `Ingredient[]` we attempted to give you as many ingredients as possible.

The model itself being wrong isn't something we can do much about. that depends on 2 things (the capabilities of the model, and the prompt you pass in).

If you wanted to capture all of the items with more rigor you could write it in this way:

    class Recipe {
        name string
        ingredients Ingredient[]
        num_ingredients int
        ...

        // add a constraint on the type
        @@assert(counts_match, {{ this.ingredients|length == this.num_ingredients }})
    }
And then if you want to be very wild, put this in your prompt:

   {{ ctx.output_format }}
   No quotes around strings
And it'll do some cool stuff


if you share your prompt with me on promptfiddle.com i can play around with it and see how i can make it better!


Which LLM?


note the per 100g prompt might lead the llm to reach for the part of its training distribution that is actually written in terms of the 100g standard and just lead to different recall rather than a suboptimal calculation based on non-standardized per 100g training examples.


The versioning makes sense to me. Software has a cycle where a new tool is created to solve a problem, and the problem winds up being meaty enough, and the tool effective enough, that the exploration of the problem space the tool unlocks is essentially a new category/skill/whatever.

computers -> assembly -> HLL -> web -> cloud -> AI

Nothing on that list has disappeared, but the work has changed enough to warrant a few major versions imo.


For me it's even simpler:

V1.0: describing solutions to specific problems directly, precisely, for machines to execute.

V2.0: giving machine examples of good and bad answers to specific problems we don't know how to describe precisely, for machine to generalize from and solve such indirectly specified problem.

V3.0: telling machine what to do in plain language, for it to figure out and solve.

V2 was coded in V1 style, as a solution to problem of "build a tool that can solve problems defined as examples". V3 was created by feeding everything and the kitchen sink into V2 at the same time, so it learns to solve the problem of being general-purpose tool.


That's less a versioning of software and more a versioning of AI's role in software. None -> Partial -> Total. Its a valid scale with regard to AI's role specifically, but I think Karpathy was intending to make a point about software as a whole, and even the details of how that middle "Partial" era evolves.


What are some predictions people are anticipating for V4?

My Hail Mary is it’s going to be groups of machines gathering real world data, creating their own protocols or forms of language isolated to their own systems in order to optimize that particular system’s workflow and data storage.


But that means AGI is going to write itself


> versioning is a bit confusing analogy because it usually additionally implies some kind of improvement

Exactly what I felt. Semver like naming analogies bring their own set of implicit meanings, like major versions having to necessarily supersede or replace the previous version, that is, it doesn't account for coexistence further than planning migration paths. This expectation however doesn't correspond with the rest of the talk, so I thought I might point it out. Thanks for taking the time to reply!


Andrej, maybe Software 3.0 is not written in spoken language like code or prompts. Software 3.0 is recorded in behavior, a behavior that today's software lacks. That behavior is written and consumed by machine and annotated by human interaction. Skipping to 3.0 is premature, but Software 2.0 is a ramp.


Would this also be more of a push towards robotics and getting physical AI in our every day lives


Very insightful! How you would describe boiling an egg is different than how a machine would describe it to another machine.


Funny that you should use boiling an egg as an example. https://www.nature.com/articles/s44172-024-00334-w


no no, it actually is a good analogy in 2 ways:

1) it is a breaking change from the prior version

2) it is an improvement in that, in its ideal/ultimate form, it is a full superset of capabilities of the previous version


Btw I notice many pretty bad errors in this transcription of the talk. The actual video will be up soon I hope.


Ah sorry! I'm going to downweight this thread now.

There's so much demand around this, people are just super eager to get the information. I can understand why, because it was my favorite talk as well :)


The video is now up and on the front page of HN:

https://news.ycombinator.com/item?id=44314423


[dead]


Submitters: "Please submit the original source. If a post reports on something found on another site, submit the latter." - https://news.ycombinator.com/newsguidelines.html


right, can you please move this whole discussion thread over there then, to avoid the duplicate conversation?


Please don't think we have haven't put any thought into this. It's not always a straightforward decision.

In this case the determining factor is that the original submission was not the verbatim speech that the speaker gave, and the speaker himself complained that that some of the transcription was inaccurate.

This then caused significant meta-discussion about the inaccuracy of the transcription. For the other comments that were about the content itself, we can't be sure if those comments were made in response to parts of the talk that were accurately or inaccurately transcribed.

We understand there may be overlapping comments, but that's a price we have to pay given the other considerations. No decision would be perfect, and it's not the same as usual "dupe" scenarios. We have to be a bit flexible when the situation is different from the norm in important ways.


How soon? I am contemplating whether to read this errorful transcript or wait for the video


anything you'd want fixed immediately? happy to do so – or even take this down if you wish. it's your talk.


Is this because it was recorded with AI tooling rather than a traditional note taker?


it was an audio recording, transcribed with speech to text models. there's definitely some errors and words lost. I also tried to emphasize this


Thanks for the clarification. Bit ironic given the talk’s subject. It is quite a bit of effort, but there’s something to say for going through and manually writing up the transcript like a journalist. Sometimes you can’t beat human effort ;)


What about a middle ground? Speech-to-text AI with manual corrections?


That’s a great approach! That’s what I meant to convey if I had been a bit more articulate. I assume journalists do exactly that. Takes away some laborious work while retaining accuracy.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: