How are you getting these results? Even with grounding in sources, careful context engineering and whatever technique comes to your mind we are just getting sloppy junk out of all models we have tried.
The sketchy part is that LLMs are super good at faking confidence and expertise all while randomly injected subtle but critical hallucinations. This ruins basically all significant output. Double-checking and babysitting the results is a huge time and energy sink. Human post-processing negates nearly all benefits.
Its not like there is zero benefit to it, but I am genuinely curious how you get consistently correct output for a "complicated subject matter like insurance".
I genuinely think that biggest issue LLM tools is that most people expect magic because first attempts at some simple things feel magical. however, they take insane amount of time to get expertise in. what is confusing is that I think SWEs spent immense amounts of time in general learning the tools of the trade but this seems to escape a lot of people when it comes to LLMs. on my team, every developer is using LLMs all day, every day. on average based on sprint retros each developer spends no less than an hour each day experimenting/learning/reading… how to make them work. the realization we made early is that when it comes to LLMs there are two large groups:
- group that see them as invaluable tools capable of being an immense productivity multiplier
- group that tried things here and there and gave up
we collectively decided that we want to be in the first group and were willing to put time to be in that group.
I'm persisting, have been using LLMs quite a bit for the last year, they're now where I start with any new project. Throughout that time I've been doing constant experimentation and have made significant workflow improvements throughout.
I've found that they're a moderate productivity increase, i.e. on a par with, say, using a different language, using a faster CI system, or breaking down some bureaucracy. Noticeable, worth it, but not entirely transformational.
I only really get useful output from them when I'm holding _most_ of the context that I'd be holding if writing the code, and that's a limiting factor on how useful they can be. I can delegate things that are easy, but I'm hand-holding enough that I can't realistically parallelise my work that much more than I already do (I'm fairly good at context switching already).
How are you measuring increased productivity? Honest question, because I've seen teams claim more code, but I've also seen teams say they're seeing more unnecessary churn (which is more code).
I'm interested in business outcomes, is more code or perceived velocity translating into benefits to the business? This is really hard to measure though because in pretty much any startup or growing company you'll see better business outcomes, but it's hard to find evidence for the counterfactual.
same as we have before LLMs for a decade - story points. we move faster now, we have automated stuff we could never automate before. same project, largely same team since 2016, we just get a lot more shit done, a lot more
hehe not snarky at all - great question. this was heavily discussed but in order to measure productivity gains (we are phasing this out now) we kept the estimations the same as before. as my colleague put it you don’t estimate based on “10x developer” so we applied the same concept. now that everyone is “on board” we are phasing this out
Thanks, I'm probably a kook but I've never wanted to put any non-product, user-visible feature-related tasks on the board with story points (tests, code cleanup, etc) and just folded that into the related user work (mainly to avoid some product person thinking they "own" that and can make technical decisions).
So the product velocity didn't exactly go up, but you are now producing less technical debt (hopefully) with a similar velocity, sounds reasonable.
I'm glad you're more productive, although I would question this result both in terms of objectivity (story points are typically very subjective), and in terms of capturing all externalities of the LLM workflow. It's easy to have "build the thing", "fix the thing", "remove tech debt in the thing", "replace the thing" be 4 separate projects, each with story points, where "build the better thing" would have been one, and churn is something that is evidenced with LLM development.
dont you think it would be better off getting that expertise in actual system design, software engineering and all the programming related fields. by involving chat GPT to make code, we ll eventually lose the skill to sit and craft code like we used to do all these years. after all the brain s neural pathways only remember what you put to work daily
- lots of experimentation - specifically I have spent hours and hours doing the exact same feature (my record is 23 times).
- if something “doesn’t work” I create a task immediately to investigate and understand it. even the smallest thing that bother me I will spend hours to figure out why it might have happened (this is sometimes frustrating) and how to prevent it from happening again (this is fun)
My collegue describes the process as Javascript developer trying to learn Rust while tripping on mushrooms :)
> Its not like there is zero benefit to it, but I am genuinely curious how you get consistently correct output for a "complicated subject matter like insurance".
Most likely by trying to get a promotion or bonus now and getting the hell out of Dodge before anyone notices those subtle landmines left behind :-)
Cynical, but maybe not wrong. We are plenty familiar with ignoring technical debt and letting it pile up. Dodgy LLM code seems like more of that.
Just like tech debt, there's a time for rushing. And if you're really getting good results from LLMs, that's fabulous.
I don't have a final position on LLM's but it has only been two days since I worked with a colleague who definitely had no idea how to proceed when they were off the "happy path" of LLM use, so I'm sure there are plenty of people getting left behind.
Wow the bad faith is quite strong here. As it turns out, small to mid sized insurance companies have some ridiculously poorly architected front ends.
Not everyone is the biggest cat in town with infinite money and expertise. I have no intention of leaving anytime soon, so I have confidence that the code that was generated by the AI (after confirming with our guy who is the insurance OG) is solid improvement over what was before.
The bad faith is super strong when it's being swamped by a lot more bad faith driven by greed. I'm not talking about you, but about all these companies with overnight valuations in the billions and their PR machines.
To your example, frankly, I would have started with that very important caveat, of an initial situation defined by very poor quality. It's a very valid angle as a lot of code that's available today is of very low quality and if AI can't take 1/10 or 2/10 and make it 5/10 or 6/10, yes, everyone benefits.
A lot of programmers that say that LLMs are awesome tend to be inexperienced, not good programmers, or just gloss over the significant amount of extra work that using LLMs requires.
Programmers tend to overestimate their knowledge of non-programming domains, so the OP is probably just not understanding that there are serious issues with the LLM's output for complicated subject matters like insurance.
Depends a lot. Use it for one off scripts, particularly for anything Microsoft 365 related (expanding Sharepoint drives, analyzing AWS usage, general IT stuff). Where there is a lot of heavy context based business logic it will fail since there’s too much context for it to be successful.
I work in custom software where the gap in non-LLM users and those who at least roughly know how to use it is huge.
It largely depends on the prompt though. Our ChatGPT account is shared so I get to take a gander at the other usages and it’s pretty easy see: “okay this person is asking the wrong thing”. The prompt and the context has a major impact on the quality of the response.
In my particular line of work, it’s much more useful than not. But I’ve been focusing on helping build the right prompts with the right context, which makes many tasks actually feasible where before it would be way out of scope for our clients budgets.
The sketchy part is that LLMs are super good at faking confidence and expertise all while randomly injected subtle but critical hallucinations. This ruins basically all significant output. Double-checking and babysitting the results is a huge time and energy sink. Human post-processing negates nearly all benefits.
Its not like there is zero benefit to it, but I am genuinely curious how you get consistently correct output for a "complicated subject matter like insurance".