Just curious what area you work in? Python or some kind of web service / Jscript? I'm sure the LLMs are reasonably good for that - or for updating .csv files (you mention spreadsheets).
I write code to drive hardware, in an unusual programming style. The company pays for Augment (which is now based on o4, which is supposed to be really good?!?). It's great at me typing: print_debug( at which point it often guesses right as to which local variables or parameters I want to debug - but not always. And it can often get the loop iteration part correct if I need to, for example, loop through a vector. The couple of times I asked it to write a unit test? Sure, it got a the basic function call / lambda setup correct, but the test itself was useless. And a bunch of times, it brings back code I was experimenting with 3 months ago and never kept / committed, just because I'm at the same spot in the same file..
I do believe that some people are having reasonable outcomes, but it's not "out of the box" - and it's faster for me to write the code I need to write than to try 25 different prompt variations.
A lot of python in a monorepo. Mono repos have an advantage right now because the LLM can pretty much look through the entire repo. But I'm also applying LLM to eliminate a lot of roles that are obsolete, not just using it to code.
Thanks for sharing your perspective with ACTUAL details unlike most people that have gotten bad results.
Sadly hardware programming is probably going to lag or never be figured out because there's just not enough info to train on. This might change in the future when/if reasoning models get better but there's no guarantee of that.
Augment uses many models, including ones that we train ourselves. Each interaction you have with Augment will touch multiple models. Our perspective is that the choice of models is an implementation detail, and the user does not need to stay current with the latest developments in the world of AI models to fully take advantage of our platform.
Which IMO is....a cop out, a terrible take, and just...slimey. I would not trust a company like this with my money. For all you know they are running your prompts against a shitty open source model running on a 3090 in their closet. The lack of transparency here is concerning.
You might be getting bad results for a few reasons:
- your prompts are not specific enough
- your context is poisoned. how strategically are you providing context to the prompt? a good trick is to give the llm an existing file as an example to how you want it to produce the output and tell it "Do X in the style of Y.file". Don't forget with the latest models and huge context windows you could very well provide entire subdirectories into context (although I would recommend being pretty targeted still)
- the model/tool you're using sucks
- you work in a problem domain that LLMs are genuinely bad at
Note: your company is paying a subscription to a service that isn't allowing you to bring your own keys. they have an incentive to optimize and make sure you're not costing them a lot of money. This could lead to worse results.
I suggest this as the bare minimum for the HN community when discussing their bad results with LLMs and coding:
- what is your problem domain
- show us your favorite prompt
- what model and tools are you using?
- are you using it as a chat or an agent?
- are you bringing your own keys or using a service?
- what did you supply in context when you got the bad result?
- how did you supply context? copy paste? file locations? attachments?
- what prompt did you use when you got the bad result?
I'm genuinely surprised when someone complaining about LLM results provides even 2 of those things in their comment.
Most of the cynics would not provide even half of this because it'd be embarrassing and reveal that they have no idea what they are talking about.
But how is AI supposed to replace anyone when you have either to get lucky or to correctly set up all these things you write about first? Who will do all that and who will pay for it?
So your critique of AI is that it can't read your mind and figure out what to do?
> But how is AI supposed to replace anyone when you have either to get lucky or to correctly set up all these things you write about first? Who will do all that and who will pay for it?
I mean....i'm doing it and getting paid for it so...
Yes, because AGI is advertised(or reviled) as such. That you plug it in and it figures everything else out itself. No need for training and management like for humans.
In other words, did the AI actually replace you in this case? Do you expect it to? Because people clearly expect it, then we have such discussions as this.
I write code to drive hardware, in an unusual programming style. The company pays for Augment (which is now based on o4, which is supposed to be really good?!?). It's great at me typing: print_debug( at which point it often guesses right as to which local variables or parameters I want to debug - but not always. And it can often get the loop iteration part correct if I need to, for example, loop through a vector. The couple of times I asked it to write a unit test? Sure, it got a the basic function call / lambda setup correct, but the test itself was useless. And a bunch of times, it brings back code I was experimenting with 3 months ago and never kept / committed, just because I'm at the same spot in the same file..
I do believe that some people are having reasonable outcomes, but it's not "out of the box" - and it's faster for me to write the code I need to write than to try 25 different prompt variations.