Let me be clear first. I don't dislike LLMs, I query them, trigger agents to do stuff where I kind of know what the end goal is and to make analisys of small parts of an application.
That said, everytime I give it something a little more complex that do something in a single file script it fails me horribly. Either the code is really bad, or the approach is as bad a someone who doesn't really know what to do or it plains start doing things that I explicitly said not to do in the initial prompt.
I have sometimes asked my LLM fan's coworkers to come and help when that happens and they also are not able to "fix it", but somehow I am the one doing it wrong due "wrong prompt" or "lack of correct context".
I have created a lot of "Agents.md" files, drop files into the context window... Nothing.
When I need to do green field stuff, or PoCs it delivers fast, but then applying it to work inside an existent big application fails.
The only place where I feel as "productive" as I heard from other people is when I do stuff in languages or technologies I don't know at all, but then again, I also don't know if that functional code I get at the end is broken in things I am not aware of.
Are any of you guys really using LLMs to create full features in big enterprise apps?
But other than that what I’ve found to be the most important is static tooling. Do you have rules that require tests to be run, do you have linters and code formatters that enforce your standards? Are you using well known tools (build tools, dependency management tools etc) or is that bespoke.
But the less sexy answer is that no, you can’t drop an agent cold into a big codebase and expect it to perform miracles. You need to build out agentic flows as a process that you iterate and improve on. If you prompt an agent and it gets it wrong, evaluate why and build out the tools so next time it won’t get it wrong. You slowly level up the capabilities of the tool by improving it over time.
I can’t emphasize enough the difference in agents though. I’ve been doing a lot of ab tests with copilot against other agents and it’s wild how bad it is, even backed with the same models.