Hacker Newsnew | past | comments | ask | show | jobs | submit | libraryofbabel's commentslogin

The larger issue, of course, is that eccentric individuals and niche special-interest groups are able to use the planning process and the legal system to jam up all sorts of infrastructure projects in America, from simple turn lanes all the way to high-speed rail. This is not the only reason America has trouble building infrastructure, but it is an important reason. See Ezra Klein and Derek Thompson‘s new book Abundance for a long-form analysis of this… or for a contrast with the US’s “lawyerly society” (and, of course, the disadvantages of leaning too much in the other direction) Dan Wang’s Breakneck: China's Quest to Engineer the Future that just came out.

Both are excellent books and will probably appeal to a lot of Hacker News folks with an engineering/builder mindset.


Freakonomics interviewed Dan Wang about his book Breakneck back in September, see episode #647. It's a very interesting lens through which to view both societies, worth a listen!

One of the things I liked about her interview was how she candidly says her strengths are less in opening up new areas or proving new theorems and more reworking and clarifying existing areas (i.e. Lurie’s work) with cleaner approaches and new proofs to make them more accessible and therefore more useful.

This seems to me to be admirable, and perhaps under-appreciated. Although it is probably much more valued in mathematics than most other fields, perhaps because mathematicians place more value than other fields on simplicity and clarity of exposition for its own sake, and because it is just so hard to read unfamiliar mathematics. Her north star goal of making her field accessible to mathematics undergraduates was a nice one.

I would like to learn category theory properly one day, at least to that kind of "advance undergraduate" level she mentions. It's always seemed to me when dipping into it that it should be easier to understand than it is, if that makes sense - like the terminology and notation and abstraction are forbidding, but the core of "objects with arrows between them" also has the feeling of something that a (very smart) child could understand. Time to take another crack at it, perhaps?


> I would like to learn category theory properly one day, at least to that kind of "advance undergraduate" level she mentions.

As someone who tried to learn category theory, and then did a mathematics degree, I think anyone who wants to properly learn category theory would benefit greatly from learning the surrounding mathematics first. The nontrivial examples in category theory come from group theory, ring theory, linear algebra, algebraic topology, etc.

For example, Set/Group/Ring have initial and final objects, but Field does not. Why? Really understanding requires at least some knowledge of ring/field theory.

What is an example of a nontrivial functor? The fundamental group is one. But appreciating the fundamental group requires ~3 semesters of math (analysis, topology, group theory, algebraic topology).

Why are opposite categories useful? They can greatly simplify arguments. For example, in linear algebra, it is easier to show that the row rank and column rank of a matrix are equal by showing that the dual/transpose operator is a functor from the opposite category.


Agreed. In addition to yours, notions like limits/colimits, equalisers/coequalisers, kernels/cokernels, epi/monic will be very hard to grasp a motivation for without a breadth of mathematical experience in other areas.

Like learning a language by strictly the grammar and having 0 vocabulary.


I should have mentioned in my post that I have an applied math masters and a solid amount of analysis and linear algebra with some group theory, set theory, and a smattering of topology (although no algebraic topology). So, I'm not coming to this with nothing, although I don't have the very deep well of abstract algebra training that a pure mathematician coming to category theory would have.

Although, it feels like category theory _ought_ to be approachable without all those years of advanced training in those other areas of math. Set theory is, up to a point. But maybe that isn't true and you're restricted to trivial examples unless you know groups and rings and fields etc.?


You could take a look at Topology: A Categorical Approach by Bradley, Bryson and Terilla.

It's a crisp, slim book, presenting topology categorically (so the title is appropriate). It both deepens the undergraduate-level understanding of topology and serves as an extended example of how category theory is actually used to clarify the conceptual structure of a mathematical field, so it's a way to see how the flesh is put on the bare bones of the categorical concepts.

It's also available for free online:

https://topology.mitpress.mit.edu/


Actually this is a better source for it because it includes a pdf of the table of contents and links to supplementary videos

https://jterilla.github.io/TopologyBook/


You might also find the work of David I. Spivak (no relation to the _Calculus on Manifolds_ Spivak) helpful in this endeavor.

John Baez (who is distantly related to Joan Baez, if memory serves) has also written a lot of introductory category theory and applied category theory.


Oh thanks, I will take a look. I’ve read some of John Baez’s things but mostly on mathematical physics, which was my undergrad. I didn’t know he’d written on category theory.

I think he and Joan Baez are actually first cousins!


I'm maybe too close to the problem to evaluate well (studied foundational math) but I know that Lawvere and Schanuel's book "Conceptual Mathematics" has been fairly well-regarded as a path into category theory.

> it is just so hard to read unfamiliar mathematics

I have completely given up on trying to learn anything about math from Wikipedia. It’s been overrun by mathematicians apparently catering to other mathematicians and that’s not the point of an encyclopedia.

It’s hostile and pointless. If you want a technically correct site make your own.


It appears that they have.

Can we have ours back then? I wish I still remembered enough math to tackle one of them.

He mentions he has nobody reporting to him. That sounds like he’s really a staff engineer with a vanity CTO title, plus a lot of sway in strategic decision making.

It’s not a guaranteed recipe for disaster, but it depends critically on his relationship with whoever actually manages the engineering org. If they don’t pull in the same direction, things go south very quickly and you end up with a little civil war.

Either way it’s a red flag and I wouldn’t work there. Another red flag is that he wrote this blog post at all. Given how clearly negative the reaction to it was going to be, it’s a strong signal he doesn’t really think things through and has a ego wrapped up in his “coding” prowess and ability to circumvent process. People mention Woz as an example of a technical co-founder in a non-management role, but he is a humble guy and wouldn’t brag like this.


It's a nice analogy, and I think I'll use it in future.

If you want another one, think of painting. An "Old Master" painter like Rembrandt or Rubens or Botticelli would have had a large workshop with a team of assistants, who would not only do a lot of the work like stretching canvases or mixing the paints, but would also - under the master's direction - actually do a lot of the painting too. You might have the master sketch out the composition, and then paint the key faces (and, most of all, the eyes) and then the assistants would fill in areas like drapery, landscape, etc.

This changed in the Romantic period towards the end of the 1700s, with the idea of the individual artist, working alone in a moment of creative inspiration and producing a single work of genius from start to finish. Caspar David Friedrich or JMW Turner come to mind here.

Some programmers want to be Turner and control the whole work and feel their creativity is threatened if a machine can now do parts of it as well as they could. I'd rather be Rembrandt and sketch out the outline, paint the eyes, and leave the rest to junior engineers... or an AI Agent. It's a matter of preference.


> I'd rather be Rembrandt and sketch out the outline, paint the eyes, and leave the rest to junior engineers

What you’re not mentioning is that code isn’t and end product. It’s the blueprint for one. The end product is the process running and solving sone needs.

What makes software great is how easy it is to refine. The whole point of software engineering is to ensure confidence that the blueprint is good, and that the cost of changes is not enormous. It’s not about coding quickly, throw it over the wall and be done.

The process you outline would be like noting down a few riffs, fully composing a few minutes (measures?) and then have a few random people complete the full symphony. It’s not a matter of having a lot of music sheet, it’s a matter of having good music. The music sheet is important because it helps transmit the ideas to the conductor, who then trains the orchestra. But the audience doesn’t care about it.

So same, users don’t care about the code, but they do care about bugs and not having features. Acting on those feedbacks requires good code. If you can get good code with your process, it’s all good. Bit I’m still waiting for the proof.


Does anyone know / care to speculate how they actually make this work, in terms of the LLM call loop? Specifically: does it call back to the LLM after each keystroke sending it the new state of the interactive tool, or does it batch keystrokes up? If the former, isn’t that very slow? If the latter, won’t that cause it to make mistakes with a tool it hasn’t used before?

I think this is the PR that implemented the feature: https://github.com/google-gemini/gemini-cli/pull/6694

> feat(shell): enable interactive commands with virtual terminal


I don't think anyone can reasonably argue against Claude Code being the most full-featured and pleasant to use of the CLI coding agent tools. Maybe some people like the Codex user experience for idiosyncratic reasons, but it (like Gemini CLI) still feels to me rather thrown together - a Claude Clone with a lot of rough edges.

But these CLI tools are still fairly thin wrappers around an LLM. Remember: they're "just an LLM in a while loop with access to tool calls." (I exaggerate, and I love Claude Code's more advanced features like "skills" as much as anyone, but at the core, that's what they are.) The real issue at stake is what is the better LLM behind the agent: is GPT-5 or Sonnet 4.5 better at coding. On that I think opinion is split.

Incidentally, you can run Claude Code with GPT-5 if you want a fair(er) comparison. You need a proxy like LiteLLM and you will have to use the OpenAI api and pay per-token, but it's not hard to do and quite interesting. I haven't used it enough to make a good comparison, however.


> but it (like Gemini CLI) still feels to me rather thrown together - a Claude Clone with a lot of rough edges.

I think this is because they see it as a checkbox whereas Anthropic sees it as a primary feature. OpenAI and Google just have to invest enough to kill Anthropic off and then decide what their own vision of coding agents looks like.


You can run the Claude code router and choose the model you want (including based on dynamic conditions)


Can you say more? Link?



Thick or thin, the wrapper so that users aren't manually copy and pasting code around is material to it being used and useful. Plus the systems prompt is custom to each tool and greatly affect how well the tool works.


You can actually use Codex right from Claude Code as an MCP without that proxy stuff and it works really well, especially for review or solving things Claude couldn't. Best of both worlds!


What would you say is an example of one of those “middle” tasks it can help with?


An example I just found worked very well with fine-tuning: I wanted to extract any frame that contained a full-screen presentation slide from a various videos I've archived, only when it's full-screen, and also not capture videos, and some other constraints.

Naturally I reached for CLIP+ViT which got me a ~60% success rate out of the box. Then based on that, I created a tiny training script that read `dataset/{slide,no_slide}` and trained a new head based on that. After adding ~100 samples of each, the success rate landed at 95% which was good enough to call it done, and circle back to iterate once I have more data.

I ended up with a 2.2K large "head_weights.safetensors" that increased the accuracy by ~35% which felt really nice.


I agree with all of this. Many is time I've had to tell developers I work with: "don't just look at the mean/median, look at a graph of the full distribution!... then slice your distribution a lot of different ways by all the tags/facets you have and look again at the slices." Often you find that a shift in the mean or median was driven by one particular class of data points that skewed the whole thing. (Looking at you, NVDA.) This is usually a little lecture I give in the context of performance engineering, where it's api response times or whatever, but it applies everywhere.

At the same time - and I think you agree with this and it's probably implicit in your comment - we have to beware of anecdata as well. "Two of my friends asked me for money" means very little, except that your friend group is having a rough time. The meso-scale, your "high resolution 2d data", is where to look if you want a textured picture of what's really going on while at the same time avoiding observer bias. Unfortunately, that kind of data is not always easy to get, or to interpret.


You forgot mcp-everything!

Yes, it's a mess, and there will be a lot of churn, you're not wrong, but there are foundational concepts underneath it all that you can learn and then it's easy to fit insert-new-feature into your mental model. (Or you can just ignore the new features, and roll your own tools. Some people here do that with a lot of success.)

The foundational mental model to get the hang of is really just:

* An LLM

* ...called in a loop

* ...maintaining a history of stuff it's done in the session (the "context")

* ...with access to tool calls to do things. Like, read files, write files, call bash, etc.

Some people call this "the agentic loop." Call it what you want, you can write it in 100 lines of Python. I encourage every programmer I talk to who is remotely curious about LLMs to try that. It is a lightbulb moment.

Once you've written your own basic agent, if a new tool comes along, you can easily demystify it by thinking about how you'd implement it yourself. For example, Claude Skills are really just:

1) Skills are just a bunch of files with instructions for the LLM in them.

2) Search for the available "skills" on startup and put all the short descriptions into the context so the LLM knows about them.

3) Also tell the LLM how to "use" a skill. Claude just uses the `bash` tool for that.

4) When Claude wants to use a skill, it uses the "call bash" tool to read in the skill files, then does the thing described in them.

and that's more or less it, glossing over a lot of things that are important but not foundational like ensuring granular tool permissions, etc.


> You forgot mcp-everything!

One great thing about the MCP craze, is it has given vendors a motivation to expose APIs which they didn’t offer before - real example, Notion’s public REST API lacks support for duplicating pages.. yes their web UI can do it, calling their private REST API, but their private APIs are complex, undocumented, and could stop working at any time with no notice. Then they added it to their MCP server - and MCP is just a JSON-RPC API, you aren’t limited to only invoking it from an LLM agent, you can also invoke it from your favourite scripting language with no LLM involved at all


I remember reading in one of Simon Willison's recent blog posts his half-joking point that MCP got so much traction so fast because adding a remote MCP server allowed tech management at big companies whose C-suite is asking them for an "AI Strategy" to show that they were doing something. I'm sure that is a little bit true - a project framed as "make our API better and more open and well-documented" would likely never have got off the ground at many such places. But that is exactly what this is, really.

At least it's something we all reap the benefits of, even if MCP is really mostly just an api wrapper dressed up as "Advanced AI Technology."


Amazing example. AI turns the bedgrudging third rate API UX into a must-win agent UX


and we all win!


Well. I bet Notion simply forget some of APIs are private before. I started developing using Notion APIs on the first day it got released. They have constant updates and I have seen lots of improvement. There is just no reason why they intentionally want to make the duplicate page API on MCP but not api.

PS. Just want to say, Notion MCP is still very buggy. It can't handle code block, nor large page very well


> There is just no reason why they intentionally want to make the duplicate page API on MCP but not api.

I have no idea what is going on inside Notion, but if I guess - the web UI (including the private REST API which backs it), the public REST API, and the AI features are separate teams, separate PMs, separate budgets - so it is totally unsurprising they don’t all have the same feature set. Of course, if parity were an executive priority, they could get there-but I can only assume it isn’t.


Pretty true, and definitely a good exercise. But if we're going to actual use these things in practice, you need more. Things like prompt caching, capabilities/constraints, etc. It's pretty dangerous to let an agent go hog wild in an unprotected environment.


Oh sure! And if I was talking someone through building a barebones agent, I'd definitely tag on a warning along the lines of "but don't actually use this without XYZ!" That said, you can add prompt caching by just setting a couple of parameters in the api calls to the LLM. I agree constraints is a much more complex topic, although even in my 100-line example I am able to fit in a user approval step before file write or bash actions.


when you say prompt caching, does it mean cache the thing you send to the llm or the thing you get back?

sounds like prompt is what you send, and caching is important here because what you send is derived from previous responses from llm calls earlier?

sorry to sound dense, I struggle to understand where and how in the mental model the non-determinism of a response is dealt with. is it just that it's all cached?


Not dense to ask questions! There are two separate concepts in play:

1) Maintaining the state of the "conversation" history with the LLM. LLMs are stateless, so you have to store the entire series of interactions on the client side in your agent (every user prompt, every LLM response, every tool call, every tool call result). You then send the entire previous conversation history to the LLM every time you call it, so it can "see" what has already happened. In a basic agent, it's essentially just a big list of strings, and you pass it into the LLM api on every LLM call.

2) "Prompt caching", which is a clever optimization in the LLM infrastructure to take advantage of the fact that most LLM interactions involve processing a lot of unchanging past conversation history, plus a little bit of new text at the end. Understanding it requires understanding the internals of LLM transformer architecture, but the essence of it is that you can save a lot of GPU compute time by caching previous result states that then become intermediate states for the next LLM call. You cache on the entire history: the base prompt, the user's messages, the LLM's responses, the LLM's tool calls, everything. As a user of an LLM api, you don't have to worry about how any of it works under the hood, you just have to enable it. The reason to turn it on is it dramatically increases response time and reduces cost.

Hope that clarifies!


Very helpful. It helps me better understand the specifics behind each call and response, the internal units and whether those units are sent and received "live" from the LLM or come from a traditional db or cache store.

I'm personally just curious how far, clever, insightful, any given product is "on top of" the foundation models. I'm not in it deep enough to make claims one way or the other.

So this shines a little more light, thanks!


This recent comment https://news.ycombinator.com/item?id=45598670 by @simonw really helped drive home the point that LLMs are really being fed an array of strings.


Why wouldn't you turn on prompt caching? There must be a reason why it's a toggle rather than just being on for everything.


Writing to the cache is more expensive than a request with caching disabled. So it only makes economic sense to do it when you know you're going to use the cached results. See https://docs.claude.com/en/docs/build-with-claude/prompt-cac...


When you know the context is a one-and-done. Caching costs more than just running the prompt, but less than running the prompt twice.


> Some people call this "the agentic loop." Call it what you want, you can write it in 100 lines of Python

That description sounds a lot like PocketFlow, an AI/LLM development framework based on a loop that's about 100 lines of python:

https://github.com/The-Pocket/PocketFlow

(I'm not at all affiliated with Pocket Flow, I just recall watching a demo of it)


You have a great way of demystifying things. Thanks for the insights here!

Do you think a non-programmer could realistically build a full app using vibe coding?

What fundamentals would you say are essential to understand first?

For context, I’m in finance, but about 8 years ago I built a full app with Angular/Ionic (live on Play Store, under review on Apple Store at that time) after doing a Coursera specialization. That was my first startup attempt, I haven’t coded since.

My current idea is to combine ChatGPT prompts with Lovable to get something built, then fine-tune and iterate using Roo Code (VS plugin).

I’d love to try again with vibe coding. Any resources or directions you’d recommend?


If your app has to display stuff, you have no code kits available that can help you out. No vibe coding needed.

If your app has to do something useful, your app just exploded in complexity and corner cases that you will have to account for and debug. Also, if it does anything interesting that the LLM has not yet seen a hundred thousand times, you will hit the manual button quite quickly.

Claude especially (with all its deserved praise) fantasizes so much crap together while claiming absolute authority in corner cases, it can become annoying.


That makes sense, I can see how once things get complex or novel, the LLMs start to struggle. I don't think my app is doing anything complex.

For now, my MVP is pretty simple: a small app for people to listen to soundscapes for focus and relaxation. Even if no one uses, at least it's going to be useful to me and it will be a fun experiment!

I’m thinking of starting with React + Supabase (through Lovable), that should cover most of what I need early on. Once it’s out of the survival stage, I’ll look into adding more complex functionality.

Curious, in your experience, what’s the best way to keep things reliable when starting simple like this? And are there any good resources you can point to?


You can make that. The only ai coding tools i have liked is openai codex and claude code. I would start with working with it to create a design document in markdown to plan the project. Then i would close the app to reset context, and tell it to read that file, and create an implementation plan for the project in various phases. Then i would close context, and have it start implementing. I dont always like that many steps, but for a new user it can help see ways to use the tools


That’s a good advice, thank you!

I already have a feature list and a basic PRD, and I’m working through the main wireframes right now.

What I’m still figuring out is the planning and architecture side, how to go from that high-level outline to a solid structure for the app. I’d rather move step by step, testing things gradually, than get buried under too much code where I don’t understand anything.

I’m even considering taking a few React courses along the way just to get a better grasp of what’s happening under the hood.

Do you know of any good resources or examples that could help guide this kind of approach? On how to break this down, what documents to have?


I've always wanted to make an app like this. I think you could do a lot with procedural generation and some clever DSP.


Learning how to get it to run build steps was a big boost in my initial productivity when learning the cli tools


Maybe react native if you like react


> Do you think a non-programmer could realistically build a full app using vibe coding?

For personal or professional use?

If you want to make it public I would say 0% realistic. The bugs, security concerns, performance problems etc you would be unable to fix are impossible to enumerate.

But even if you had a simple loging and kept people's email and password, you can very easily have insecure dbs, insecure protections against simple things like mysqliinjections etc.

You would not want to be the face of "vibe coder gives away data of 10k users"


Ideally, I want this to grow into a proper startup. I’m starting solo for now, but as things progress, I’d like to bring in more people. I’m not a tech, product or design person, but AI gives me hope that I can at least get an MVP out and onboard a few early users.

For auth, I’ll be using Supabase, and for the MVP stage I think Lovable should be good enough to build and test with maybe a few hundred users. If there’s traction and things start working, that’s when I’d plan to harden the stack and get proper security and code reviews in place.


One of the issues AI coding has, is that its in some ways very inhuman. The bugs that are introduced are very hard to pick up because humans wouldnt write it that way, hence they wouldnt make those mistakes.

If you then introduce other devs you have 2 paths, they either build on top of vibe coding, which is going to leave you vulnerable to those bugs and honestly make their life a misery as they are working on top of work that missed basic decisions that will help it grow. (Imagine a non architect built your house, the walls might be straight but he didnt know to level the floor, or to add the right concrete to support the weight of a second floor)

Or the other path is they rebuild your entire app correctly. With the only advantage of the MVP and the users showing some viability for the idea. But the time it will take to rewrite it means in a fast moving space like start ups someone can quickly overtake you.

Its a risky proposition that means you are not going to create a very adequate base for the people you might hire.

I would still recommend against it, thinking that AI is more like WebMD, it can help someone who is already a doctor but it will confuse, and potentially hurt those without enough training to know what to look for.


Really depends on the app you want to build.

If I'd use Vibe coding I wouldn't use Lovable but Claude code. You can run it in your terminal.

And I would ask it to use NextAuth, NextJS and Prisma (or another ORM), and connect it with SQLite or an external MariaDB managed server (for easy development you can start with SQLLite, for deployment to vercel you need an external database).

People here shit on nextjs, but due to its extensive documentation & usage the LLM's are very good at building with it, and since it forces a certain structure it produces generally decently structured code that is workable for a developer.

Also vercel is very easy to deploy, just connect Github and you are done.

Make sure to properly use GIT and commit per feature, even better branch per feature. So you can easily revert back to old versions if Claude messed up.

Before starting, spend some time sparring with GPT5 thinking model to create a database scheme thats future proof before starting out. It might be a challenge here to find the right balance between over-engineering and simplicity.

One caveat: be careful to run migration on your production database with Claude. It can accidentally destroy it. So only run your claude code on test databases.


Thanks a lot for all the pointers.

I’m not 100% set on Lovable yet. Right now I’m using Stitch AI to build out the wireframes. The main reason I was leaning toward Lovable is that it seems pretty good at UI design and layout.

How does Claude do on that front? Can it handle good UI structure or does it usually need some help from a design tool?

Also, is it possible to get mobile apps out of a Next.js setup?

My thought was to start with the web version, and later maybe wrap it using Cordova (or Capacitor) like I did years ago with Ionic to get Android/iOS versions. Just wondering if that’s still a sensible path today.


It’s great at design; you can also do it in Claude code chat ui and then when you are happy copy paste it to cli

> Call it what you want, you can write it in 100 lines of Python. I encourage every programmer I talk to who is remotely curious about LLMs to try that. It is a lightbulb moment.

Definitely want to try this out. Any resources / etc. on getting started?


This is the classic blog post, by Thorsten Ball, from way back in the AI Stone Age (April this year): https://ampcode.com/how-to-build-an-agent

It uses Go, which is more verbose than Python would be, so he takes 300 lines to do it. Also, his edit_file tool could be a lot simpler (I just make my minimal agent "edit" files by overwriting the entire existing file).

I keep meaning to write a similar blog post with Python, as I think it makes it even clearer how simple the stripped-down essence of a coding agent can be. There is magic, but it all lives in the LLM, not the agent software.


> I keep meaning to write a similar blog post with Python...

Just have your agent do it.


I could, but I'm actually rather snobbish about my writing and don't believe in having LLMs write first drafts (for proofreading and editing, they're great).

(I am not snobbish about my code. If it works and is solid and maintainable I don't care if I wrote it or not. Some people seem to feel a sense of loss when an LLM writes code for them, because of The Craft or whatever. That's not me; I don't have my identity wrapped up in my code. Maybe I did when I was more junior, but I've been in this game long enough to just let it go.)


I highly relate to this. Code works or it doesn’t. My writing feels a lot more like self expression. I agree that’s harder to “let go” to an agent.


I wrote a post here with zero abstractions. Its all self contained and runs locally.

https://ravinkumar.com/GenAiGuidebook/language_models/Agents... https://github.com/canyon289/ai_agent_basics/blob/main/noteb...


It’s also a very fun project, you can set up a small LLM with ollama or lm studio and get working quickly. Using MCP it’s very fast to getting that actually useful.

I’ve done this a few times (pre and post MCP) and learned a lot each time.


Might as well include agent2agent in there: https://developers.googleblog.com/en/a2a-a-new-era-of-agent-...


How does it call upon the correct skill from a vast library of skills at the right time? Is this where RAG via embeddings / vector search come in? My mental model is still weak in this area, I admit.


I think it has a compact table of contents of all the skills it can call preloaded. It's not RAG, it navigates based on references between files, like a coding agent.


This is correct. It just puts a list of skills into context as part of the base prompt. The list must be compact because the whole point of skills is to reduce context bloat by keeping all the details out of context until they are needed. So the list will just be something like: 1) skill name, 2) short (like one sentence) description of what the skill is for, 3) where to find the skill (file path, basically) when it wants to read it in.


It's all just prompt stuffing in the end.


Tool use is only good with structured/constrained generation


You'll need to expand on what you mean, I'm afraid.


I think, from my experience, what they mean is tool use is as good as your model capability to stick to a given answer template/grammar. For example if it does tool calling using a JSON format it needs to stick to that format, not hallucinate extra fields and use the existing fields properly. This has worked for a few years and LLMs are getting better and better but the more tools you have, the more parameters your functions to call can have etc the higher the risk of errors. You also have systems that constrain the whole inference itself, for example with the outlines package, by changing the way tokens are sampled (this way you can force a model to stick to a template/grammar, but that can also degrade results in some other ways)


I see, thanks for channeling the GP! Yeah, like you say, I just don't think getting the tool call template right is really a problem anymore, at least with the big-labs SotA models that most of us use for coding agents. Claude Sonnet, Gemini, GPT-5 and friends have been heavily heavily RL-ed into being really good at tool calls, and it's all built into the providers' apis now so you never even see the magic where the tool call is parsed out of the raw response. To be honest, when I first read about tools calls with LLMs I thought, "that'll never work reliably, it'll mess up the syntax sometimes." But in practice, it does work. (Or, to be more precise, if the LLM ever does mess up the grammar, you never know because it's able to seamlessly retry and correct without it ever being visible at the user-facing api layer.) Claude Code plugged into Sonnet (or even Haiku) might do hundreds of tool calls in an hour of work without missing a beat. One of the many surprises of the last few years.


Trite, and wrong. Stalin died of a stroke at 74. To take just two more examples, Mao and Franco both died at 82, also of natural causes.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: