> At test time use this final LLM to produce multiple versions until one passes the criteria, for the cost of an hour of a software engineer you can have an LLM produce millions of different implementations.
If you're saying that it takes one software engineer one hour to produce comprehensive criteria that would allow this whole pipeline to work for a non-trivial software engineering task, this is where we violently disagree.
For this reason, I don't believe I'll be convinced by any additional citations or research, only by an actual demonstration of this working end-to-end with minimal human involvement (or at least, meaningfully less human involvement than it would take to just have engineers do the work).
edit: Put another way, what you describe here looks to me to be throwing a huge number of "virtual" low-skilled junior developers at the task and optimizing until you can be confident that one of them will produce a good-enough result. My contention is that this is not a valid methodology for reproducing/replacing the work of senior software engineers.
That's not what I'm saying at all. I'm saying that there's a trend showing that you can improve LLM performance significantly by having it generate multiple responses until it produces one that meets some criteria.
Formalizing the criteria is not as hard as you're making it out to be. You can have an LLM listen to a conversation with the "customer", ask follow up questions and define a clear spec just like a normal engineer. If you doubt it open up chatGPT, tell it you're working on X and ask it to ask you clarifying questions, then come up with a few proposal plans and then tell it which plan to follow.
> That's not what I'm saying at all. I'm saying that there's a trend showing that you can improve LLM performance significantly by having it generate multiple responses until it produces one that meets some criteria.
I apologize for misinterpreting what you were saying -- I was clearly taking "for the cost of an hour of a software engineer" to mean something that you didn't intend.
> As an example, huggigface just posted an article showing this for math, where with some sampling you can get a 3B model to outperform a 70B one
This is not relevant to our discussion. Again, I'm reasonably sure that I'm not going to be convinced by any research demonstrating that X new tech can increase Y metric by Z%.
> Formalizing the criteria is not as hard as you're making it out to be. You can have an LLM listen to a conversation with the "customer", ask follow up questions and define a clear spec just like a normal engineer. If you doubt it open up chatGPT, tell it you're working on X and ask it to ask you clarifying questions, then come up with a few proposal plans and then tell it which plan to follow.
This is much more relevant to our discussion. Do you honestly feel this is an accurate representation of how you'd define the requirements for the pipeline you outlined in your post above? Keep in mind that we're talking about having LLMs work on already-existing large codebases, and I conceded earlier that writing boilerplate/base code for a brand new project is something that LLMs are already quite good at.
Have you worked as a software engineer for a long time? I don't want to assume anything, but all of your points thus far read to me like they're coming from a place of not having worked in software much.
> Have you worked as a software engineer for a long time? I don't want to assume anything, but all of your points thus far read to me like they're coming from a place of not having worked in software much.
Yes I've been a software engineer working in deep learning for over 10 years, including as an early employee at a leading computer vision company and a founder / CTO of another startup that built multiple large products that ended up getting acquired.
> I apologize for misinterpreting what you were saying -- I was clearly taking "for the cost of an hour of a software engineer" to mean something that you didn't intend.
I meant that unlike a software engineer, the LLM can do a lot more iterations on the problem given the same budget. So if your boss comes and says build me new dashboard page it can generate 1000s of iterations and use a human aligned reward model to rank them based on which one your boss might like best. (that's what the test time compute / sampling at inference does).
> This is not relevant to our discussion. Again, I'm reasonably sure that I'm not going to be convinced by any research demonstrating that X new tech can increase Y metric by Z%.
> This is much more relevant to our discussion. Do you honestly feel this is an accurate representation of how you'd define the requirements for the pipeline you outlined in your post above? Keep in mind that we're talking about having LLMs work on already-existing large codebases,
I'm saying this will be solved pretty soon, working with large codebases doesn't work well right now because last years models had shorter context and were not trained to deal with anything longer than a few thousand tokens. Training these models is expensive so all of the coding assistant tools like cursor / devin are sitting around and waiting for the next iteration of models from Anthropic / OpenAI / Google to fix this issue. We will most likely have announcements of new long context LLMs in the next 1-2 weeks from Google / OpenAI / Deepseek / Qwen that will make major improvements on large code bases.
I'd also add that we probably don't want huge sprawling code bases, when the cost of a small custom app that solves just your problem goes to 0 we'll have way more tiny apps / microservices that are much easier to maintain and replace when needed.
Maybe I'm not making myself clear, but when I said "demonstrating that X new tech can increase Y metric by Z%" that of course included reproduction of results. Again, this is not relevant to what I'm saying.
I'll repeat some of what I've said in several posts above, but hopefully I can be clearer about my position: while LLMs can generate code, I don't believe they can satisfactorily replace the work of a senior software engineer. I believe this because I don't think there's any viable path from (A) an LLM generates some code to (B) a well-designed, complete, maintainable system is produced that can be arbitrarily improved and extended, with meaningfully lower human time required. I believe this holds true no matter how powerful the LLM in (A) gets, how much it's trained, how long its context is, etc, which is why showing me research or coding benchmarks or huggingface links or some random twitter post is likely not going to change my mind.
> I'd also add that we probably don't want huge sprawling code bases
That's nice, but the reality is that there are lots of monoliths out there, including new ones being built every day. Microservices, while solving some of the problems that monoliths introduce, also have their own problems. Again, your claims reek of inexperience.
Edit: forgot the most important point, which is that you sort of dodged my question about whether you really think that "ask ChatGPT" is sufficient to generate requirements or validation criteria.
If you're saying that it takes one software engineer one hour to produce comprehensive criteria that would allow this whole pipeline to work for a non-trivial software engineering task, this is where we violently disagree.
For this reason, I don't believe I'll be convinced by any additional citations or research, only by an actual demonstration of this working end-to-end with minimal human involvement (or at least, meaningfully less human involvement than it would take to just have engineers do the work).
edit: Put another way, what you describe here looks to me to be throwing a huge number of "virtual" low-skilled junior developers at the task and optimizing until you can be confident that one of them will produce a good-enough result. My contention is that this is not a valid methodology for reproducing/replacing the work of senior software engineers.