Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am working on a project with ~200k LoC, entirely written with AI codegen.

These days I use Codex, with GPT-5-Codex + $200 Pro subscription. I code all day every day and haven't yet seen a single rate limiting issue.

We've come a long way. Just 3-4 months ago, LLMs would start doing a huge mess when faced with a large codebase. They would have massive problems with files with +1k LoC (I know, files should never grow this big).

Until recently, I had to religiously provide the right context to the model to get good results. Codex does not need it anymore.

Heck, even UI seems to be a solved problem now with shadcn/ui + MCP.

My personal workflow when building bigger new features:

1. Describe problem with lots of details (often recording 20-60 mins of voice, transcribe)

2. Prompt the model to create a PRD

3. CHECK the PRD, improve and enrich it - this can take hours

4. Actually have the AI agent generate the code and lots of tests

5. Use AI code review tools like CodeRabbit, or recently the /review function of Codex, iterate a few times

6. Check and verify manually - often times, there are a few minor bugs still in the implementation, but can be fixed quickly - sometimes I just create a list of what I found and pass it for improving

With this workflow, I am getting extraordinary results.

AMA.



And I assume there's no actual product that customers are using that we could also demo? Because only 1 out of every 20 or so claims of awesomeness actually has a demoable product to back up those claims. The 1 who does usually has immediate problems. Like an invisible text box rendered over the submit button on their Contact Us page preventing an onClick event for that button.

In case it wasn't obvious, I have gone from rabidly bullish on AI to very bearish over the last 18 months. Because I haven't found one instance where AI is running the show and things aren't falling apart in not-always-obvious ways.


I'm kind of in the same boat although the timeline is more compressed. People claim they're more productive and that AI is capable of building large systems but I've yet to see any actual evidence of this. And the people who make these claims also seem to end up spending a ton of time prompting to the point where I wonder if it would have been faster for them to write the code manually, maybe with copilot's inline completions.


I created these demos using real data and real api connections with real databases, utilizing 100% AI code in http://betpredictor.io and https://pix2code.com; however, they barely work. At this point, I'm fixing 90% or more of every recommendation the AI gives. With you're code base being this large, you can be guaranteed that the AI will not know what needs to be edited, but I haven't written one line of hand-written code.


I can't reach either site.


pix2code screenshot doesn't load.


Neither site works bro.


It is true AI-generated UIs tend to be... Weird. In weird ways. Sometimes they are consistent and work as intended, but often times they reveal weird behaviors.

Or at least this was true until recently. GPT-5 is consistently delivering more coherent and better working UIs, provided I use it with shadcn or alternative component libraries.

So while you can generate a lot of code very fast, testing UX and UI is still manual work - at least for me.

I am pretty sure, AI should not run the show. It is a sophisticated tool, but it is not a show runner - not yet.


Nothing much weird about the SwiftUI UIs GPT-5-codex generates for me. And it adapts well to building reusable/extensible components and using my existing components instead of constantly reinventing, because it is good at reading a lot of code before putting in work.

It is also good at refactoring to consolidate existing code for reusability, which makes it easier to extend and change UI in the future. Now I worry less about writing new UI or copy/pasting UI because I know I can do the refactoring easily to consolidate.


If you tell it to use a standard component library, the UIs should be mostly as coherent as the library.


Let me summarise your comment in a few words: show me the money. If nobody is buying anything, there is no incremental value creation or augmentation of existing value in the economy that didn't already exist.


It's not the goal to have AI running the show. There's babysitting required, but it works pretty well tbh.

Note: using it for my B2B e-commerce


What is you opinion on what is the "right level of detail" that we should use when creating technical documents the LLM will use to implement features ?

When I started leaning heavily into LLMs I was using really detailed documentations. Not '20 minutes of voice recordings', but my specification documents would easily hit hundreds of lines even for simple features.

The result was decent, but extremely frustrating. Because it would often deliver 80% to 90% but the final 10% to 20% it could never get right.

So, what I naturally started doing was to care less about the details of the implementation and focus on the behavior I want. And this led me to simpler prompts, to the point that I don't feel the need to create a specification document anymore. I just use the plan mode in Claude Code and it is good enough for me.

One way that I started to think about this was that really specific documentations were almost as if I was 'over-fitting' my solution over other technically viable solutions the model could come up with. One example would be if I want to sort an array, I could either ask for "sort the array" or "merge sort the array". And by forcing a merge sort I may end up with a worse solution. Admittedly sort is a pretty simple and unlikely example, but this could happen with any topic. You may ask the model to use a hash-set but a better solution would be to use a bloom filter.

Given all that, do you think investing so much time into your prompts provides a good ROI compared with the alternative of not really min-maxing every single prompt?


I 100% agree with the over-fitting part.

I tend to provide detailed PRDs, because even if the first couple of iterations of the coding agent are not perfect, it tends to be easier to get there (as opposed to having a vague prompt and move on from there).

What I do sometimes is an experimental run - especially when I am stuck. I express my high-level vision, and just have the LLM code it to see what happens. I do not do it often, but it has sometimes helped me get out of being mentally stuck with some part of the application.

Funnily, I am facing this problem right now, and your post might just have reminded me, that sometimes a quick experiment can be better than 2 days of overthinking about the problem...


This mirrors my experience with AI so far - I've arrived at mostly using the plan and implement modes in Claude Code with complete but concise instructions about the behavior I want with maybe a few guide rails for the direction I'd like to see the implementation path take. Use cases and examples seem to work well.

I kind of assumed that claude code is doing most of the things described this document under the hood (but I really have no idea).


"The result was decent, but extremely frustrating. Because it would often deliver 80% to 90% but the final 10% to 20% it could never get right."

This is everyone's experience if they don't have a vested interest in LLM's, or if their domain is low risk (e.g., not regulated).


If it's working for you I have to assume that you are an expert in the domain, know the stack inside and out and have built out non-AI automated testing in your deployment pipeline.

And yes Step 3 is what no one does. And that's not limited to AI. I built a 20+ year career mostly around step 3 (after being biomed UNIX/Network tech support, sysadmin and programmer for 6 years).


Yes, I have over 2 decades of programming experience, 15 years working professionally. With my co-founder we built an entire B2B SaaS, coding everything from scratch, did product, support, marketing, sales...

Now I am building something new but in a very familiar domain. I agree my workflow would not work for your average "vibe coder".


> Heck, even UI seems to be a solved problem now with shadcn/ui + MCP.

I'm interested in hearing more about this - any resource you can point me at or do you mind elaborating a bit? TIA!


Basically, you install the shadcn MCP server as described here: https://ui.shadcn.com/docs/mcp

If you use Codex, convert the config to toml:

[mcp_servers.shadcn] command = "npx" args = ["shadcn@latest", "mcp"]

Now with the MCP server, you can instruct the coding agent to use shadcn. I often do "I you need to add new UI elements, make sure to use shadcn and the shadcn component registry to find the best fitting component"

The genius move is that the shadcn components are all based on Tailwind and get COPIED to your project. 95% of the time, the created UI views are just pixel-perfect, spacing is right, everything looks good enough. You can take it from here to personalize it more using the coding agent.


I've had success here by simply telling Codex which components to use. I initially imported all the shadcn components into my project and then I just say things like "Create a card component that includes a scrollview component and in the scrollview add a table with a dropdown component in the third column"...and Codex just knows how to add the shadcn components. This is without internet access turned on by the way.


Telling which component to use works perfectly too, if you want a very specific look.


> 1. Describe problem with lots of details (often recording 20-60 mins of voice, transcribe)

I just ask it to give me instructions for a coding agent and give it a small description of what I want to do, it looks at my code, and details what I describes as best as it can, and usually I have enough to let Junie (JetBrains AI) run on.

I can't personally justify $200 a month, I would need to see seriously strong results for that much. I use AI piecemeal because it has always been the best way to use it. I still want to understand the codebase. When things break its mostly on you to figure out what broke.


A small description can be extrapolated to a large feature, but then you have to accept the AI filling in the gaps. Sometimes that is cool, often times it misses the mark. I do not always record that much, but if I have a vague idea that I want to verbalize, I use recording. Then I take the transcript and create the PRD based on it. Then I iterate a few more times on the PRD - which yield much better results.


I can recommend one more thing: tell the LLM frequently to "ask me clarifying questions". It's simple, but the effect is quite dramatic, it really cuts down on ambiguity and wrong directions without having to think about every little thing ahead of time.


When do you do that? You give it the PRD and tell it to ask clarifying questions? Will definitely try that.


The "ask my clarifying questions" can be incredibly useful. It often will ask me things I hadn't thought of that were relevant, and it often suggests very interesting features.

As for when/where to do it? You can experiment. I do it after step 1.


Before or after.

"Here is roughly what I want, ask me clarifying questions"

Now I pick and choose and have a good idea if my assumptions and the LLMs assumptions align.


yeah if you read our create_plan prompt, it sets up a 3+ phase back and forth soliciting clarifying questions before the plan is built!


Don't want to come off as combative but if you code every day with codex you must not be pushing very hard, I can hit the weekly quota in <36 hours. The quota is real and if you're multi-piloting you will 100% hit it before the week is over.


On the Pro tier? Plus/Team is only suitable for evaluating the tool and occasional help

Btw one thing that helps conserve context/tokens is to use GPT 5 Pro to read entire files (it will read more than Codex will, though Codex is good at digging) and generate plans for Codex to execute. Tools like RepoPrompt help with this (though it also looks pretty complicated)


Yes, the $200 tier. I do use GPT5/Gemini 2.5 to generate plans that I hand off to codex, that's actually how I keep my agents super busy.


Bracing myself for the inevitability of keeping 3-5 Pro subscriptions at once


I thought about it, but I don't think it's necessary. Grok-4-fast is actually quite a good model, you can just set up a routing proxy in front of codex and route easy queries to it, and for maybe $50/mo you'll probably never hit your GPT plan quota.


Maybe, but I'd rather pay for consistent access to state of the art quality even if it's slower (which hasn't mattered much while parallelizing)


Fair enough. I spend entire days working on the product, but obviously there are lots of times I am not running Codex - when reviewing PRDs, testing, talking to users, even posting on HN is good for the quota ;)


This sounds very similar to my workflow. Do you have pre-commits or CI beyond testing? I’ve started thinking about my codebase as an RL environment with the pre-commits as hyperparameters. It’s fascinating seeing what coding patterns emerge as a result.


I think pre-commit is essential. I enforce conventional commits (+ a hook which limits commit length to 50 chars) and for Python, ruff with many options enabled. Perhaps the most important one is to enforce complexity limits. That will catch a lot of basic mistakes. Any sanity checks that you can make deterministic are a good idea. You could even add unit tests to pre-commit, but I think it's fine to have the model run pytest separately.

The models tend to be very good about syntax, but this sort of linting will often catch dead code like unused variables or arguments.

You do need to rule-prompt that the agent may need to run pre-commit multiple times to verify the changes worked, or to re-add to the commit. Also, frustratingly, you also need to be explicit that pre-commit might fail and it should fix the errors (otherwise sometimes it'll run and say "I ran pre-commit!") For commits there are some other guardrails, like blanket denying git add <wildcard>.

Claude will sometimes complain via its internal monologue when it fails a ton of linter checks and is forced to write complete docstrings for everything. Sometimes you need to nudge it to not give up, and then it will act excited when the number of errors goes down.


Very solid advice. I need to experiment more with the pre-commit stuff, I am a bit tired of reminding the model to actually run tests / checks. They seem to be as lazy about testing as your average junior dev ;)


Yes, I do have automated linting (a bit of a PITA at this scale). On the CI side I am using Github Actions - it does the job, but haven't put much work into it yet.

Generally I have observed that using a statically typed language like Typescript helps catching issues early on. Had much worse results with Ruby.


>I am working on a project with ~200k LoC, entirely written with AI codegen.

I’d love to see the codebase if you can share. My experience with LLM code generation (I’ve tried all of the popular models and tools, though generally favor Claude Code with Opus and Sonnet). My time working with them leads me to suspect that your ~200k LoC project could be solved in only about 10k LoC. Their solutions are unnecessary complex (I’m guessing because they don’t “know” the problem, in the way a human does) and that compounds over time. At this point, I would guess my most common instruction to this tools is to simplify the solution. Even when that’s part of the plan.


Which of these steps do you think/wish could be automated further? Most of the latter ones seem like throwing independent AI reviewers could almost fully automate it, maybe with a "notify me" option if there's something they aren't confident about? Could PRD review be made more efficient if it was able to color code by level of uncertainty? For 1, could you point it to a feed of customer feedback or something and just have the day's draft PRD up and waiting for you when you wake up each morning?


There is definitely way too much plumbing and going back and forth.

But one thing that MUST get better soon is having the AI agent verify its own code. There are a few solutions in place, e.g. using an MCP server to give access to the browser, but these tend to be brittle and slow. And for some reason, the AI agents do not like calling these tools too much, so you kinda have to force them every time.

PRD review can be done, but AI cannot fill the missing gaps the same way a human can. Usually, when I create a new PRD, it is because I have a certain vision in my head. For that reason, the process of reviewing the PRD can be optimized by maybe 20%. OR maybe I struggle to see how tools could make me faster at reading and commenting / editing the PRD.


Agents __SHOULD NOT__ verify their own code. They know they wrote it, and they act biased. You should have a separate agent with instructions to red team the hell out of a commit, be strict, but not nitpick/bikeshed, and you should actually run multiple review agents with slightly different areas of focus since if you try to run one agent for everything it'll miss lots of stuff. A panel of security, performance, business correctness and architecture/elegance agents (armed with a good covering set of code context + the diff) will harden a PR very quickly.


Codex uses this principle - /review runs in a subthread, does not see previous context, only git diff. This is what I am using. Or I open Cursor to review code written by GPT-5 using Sonnet.


Do you have examples of this working, or any best practices on how to orchestrate it efficiently? It sounds like the right thing to do, but it doesn't seem like the tech is quite to the point where this could work in practice yet, unless I missed it. I imagine multiple agents would churn through too many tokens and have a hard time coming to a consensus.


I've been doing this with Gemini 2.5 for about 6 months now. It works quite well, it doesn't catch big architectural 100% but it's very good at line/module level logic issues and anti-patterns.


Have you considered or tried adding steps to create / review an engineering design doc? Jumping straight from PRD to a huge code change seems scary. Granted, given that it's fast and cheap to throw code away and start over, maybe engineering design is a thing of the past. But still, it seems like it would be useful to have it delineate the high-level decisions and tradeoffs before jumping straight into code; once the code is generated it's harder to think about alternative approaches.


It depends. But let me explain.

Adding an additional layer slows things down. So the tradeoff must be worth it.

Personally, I would go without a design doc, unless you work on a mission-critical feature humans MUST specify or deeply understand. But this is my gut speaking, I need to give it a try!


Yeah I'd love to hear more about that. Like the way I imagine things working currently is "get requirement", "implement requirement", more or less following existing patterns and not doing too much thinking or changing of the existing structure.

But what I'd love to see is, if it has an engineering design step, could it step back and say "we're starting to see this system evolve to a place where a <CQRS, event-sourcing, server-driven-state-machine, etc> might be a better architectural match, and so here's a proposal to evolve things in that direction as a first step."

Something like Kent Beck's "for each desired change, make the change easy (warning: this may be hard), then make the easy change." If we can get to a point where AI tools can make those kinds of tradeoffs, that's where I think things get slightly dangerous.

OTOH if AI models are writing all the code, and AI models have contexts that far exceed what humans can keep in their head at once, then maybe for these agents everything is an easy change. In which case, well, I guess having human SWEs in the loop would do more harm than good at that point.


I have LLMs write and review design docs. Usually I prompt to describe the doc, the structure, what tradeoffs are especially important, etc. Then an LLM writes the doc. I spot check it. A separate LLM reviews it according to my criteria. Once everything has been covered in first draft form I review it manually, and then the cycle continues a few times. A lot of this can be done in a few minutes. The manual review is the slowest part.


How does it compare to Cursor with Claud? I’ve been really impressed with how well Cursor works, but always interested in up leveling if there’s better tools considering how fast this space is moving. Can you comment to how Codex performs vs Cursor?


Claude code is Claude code, whether you use in cursor or not

Codex and Claude code are neck and neck, but we made the decision to go all in on opus 4, as there are compounding returns in optimizing prompts and building intuition for a specific model

That said I have tested these prompts on codex, amp, opencode, even grok 4 fast via codebuff, and they still work decently well

But they are heavily optimized from our work with opus in particular


What do you mean by "compounding returns" here?


What platform are you developing for, web?

Did you start with Cursor and move to Codex or only ever Codex?


Not OP, but I use Codex for back-end, scripting, and SQL. Claude Code for most front-end. I have found that when one faces a challenge, the other often can punch through and solve the problem. I even have them work together (moving thoughts and markdown plans back and fourth) and that works wonders.

My progression: Cursor in '24, Roo code mid '25, Claude Code in Q2 '25, Codex CLI in Q3 `25.


Cursor for me until 3-4 weeks ago, now Codex CLI most of the time.

These tools change all the time, very quickly. Important to stay open to change though.


Yes, it is a web project with next.js + Typescript + Tailwind + Postgres (Prisma).

I started with Cursor, since it offers a well-rounded IDE with everything you need. It also used to be the best tool for the job. These days Codex + GPT-5-Codex is king. But I sometimes go back to Cursor, especially when reading / editing the PRDs or if I need the ocasional 2nd opinion by Claude.


Hey, this sounds a lot like what we have been doing. We would love to chat with you, and share notes if you are up for it!

Drop us an email at navan.chauhan[at]strongdm.com


This just won't work beyond a one-person team


Then I will adapt and expand. Have done it before.

I am not giving universal solutions. I am sharing MY solution.


What is the % breakdown of LOC for tests vs application code?


200k LoC + 80k LoC for tests.

I have roughly 2k tests now, but should probably spend a couple of days before production release to double that.


Are you vibe coding or have the 200k LoC been human reviewed?


I would not call it vibe coding. But I do not check all changed lines of code either.

In my opinion, and this is really my opinion, in the age of coding with AI, code review is changing as well. If you speed up how much code can be produced, you need to speed up code review accordingly.

I use automated tools most of the time AND I do very thorough manual testing. I am thinking about a more sophisticated testing setup, including integration tests via using a headless browser. It definitely is a field where tooling needs to catch up.


| code review is changing as well.

Hard disagree but you do you.


It’s unbelievable right? I’m flabbergasted that there are engineers like this shipping code.


We've all been waiting for the other shoe to drop. Everyone points out that reviewing code is more difficult than writing it. The natural question is, if AI is generating thousands of lines of code per day, how do you keep up with reviewing it all?

The answer: you don't!

Seems like this reality will become increasingly justified and embraced in the months to come. Really though it feels like a natural progression of the package manager driven "dependency hell" style of development, except now it's your literal business logic that's essentially a dependency that has never been reviewed.


I don't believe they've shipped yet, based on their comments.


Tools change, standards do not.

My process is probably more robust than simply reviewing each line of code. But hey, I am not against doing it, if that is your policy. I had worked the old-fashioned way for over 15 years, I know exactly what pitfalls to watch out for.


And this my friends is why software engineering is going down the drain. Weve made our professiona joke. Can you imagine an architect or civil engineer speaking like this? These kind of people make me want to change to a completely new discipline.


Strong feelings are fair, but the architect analogy cuts the other way. Architects and civil engineers do not eyeball every rebar or hand compute every load. They probably use way more automation than you would think.

I do not claim this is vibe coding, and I do not ship unreviewed changes to safety critical systems (in case this is what people think). I claim that in 2025 reviewing every single changed line is not the only way to achieve quality at the scale that AI codegen enables. The unit of review is shifting from lines to specifications.


[flagged]


You can't attack another user like that here.

Since you've continued to break the site guidelines right after we asked you to stop, I've banned the account.

If you don't want to be banned, you're welcome to email [email protected] and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.


You were never an engineer. I'm 18 years into my career on the web and games and I was never an engineer. It's blind people leading blind people and your somewhere in the middle based on 2013 patterns you got to this point on and 2024 advancements called "Vibe Coding" and you get paid $$ to make it work.

Building a bridge from steel that lasts 100 years and carries real living people in the tens or hundreds of thousands per day without failing under massive weather spikes is engineering.


What does PRD mean? I never heard that acronym before.


Product Requirements Document

It is a fairly standardized way of capturing the essens of a new feature. It covers most important aspects of what the feature is about, the goals, the success criteria, even implementation details where it makes sense.

If there is interest, I can share the outline/template of my PRDs.


I'd be very interested



Wow, very nice. Thank you. That's very well thought out.

I'm particularly intrigued by the large bold letters: "Success must be verifiable by the AI / LLM that will be writing the code later, using tools like Codex or Cursor."

May I ask, what your testing strategy is like?

I think you've encapsulated a good best practices workflow here in a nice condensed way.

I'd also be interested to know how you handle documentation but don't want to bombard you with too many questions


I added that line, because otherwise the LLM would generate goals that are not verifiable in development (e.g. certain pages to render <300ms - this is not something you can test on your local machine).

Documentation is a different topic - I have not yet found how to do it correctly. But I am reading about it and might soon test some ideas to co-generate documentation based on the PRD and the actual code. The challenge being, the code normally evolves and drifts away from the original PRD.


I think the only way to keep documentation up-to-date is to have it as part of the PR review process. Knowledge needs to evolve with code.

We working on this at https://dosu.dev/ (open to feedback!)



can you expand on how you use shadcn UI with MCP?


I add the MCP server (https://ui.shadcn.com/docs/mcp)

Then I instruct the coding agent to use shadcn / choose the right component from shadcn component registry

The MCP server has a search / discovery tool, and it can also fetch individual components. If you tell the AI agent to use a specific component, it will fetch it (reference doc here: https://ui.shadcn.com/docs/components)


Can we see it?


No, because everyone that claims to have coded some amazing software with AI Code Generator 3000 never seems to share their project. Curious.


Book a demo! Really, it will not be self-service just yet, because it requires a bit of holding hands in the beginning.

But I am working on making a solid self-service signup experience - might need a couple of weeks to get it done.


But you claim to have AI to write it for you? It can't even do a signup page?


lol, my guy invented programming with more steps


Please don't cross into personal attack. Also, please don't post snark to HN threads. This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.


Programming has always had these steps, but traditionally people with different roles would do different parts of it, like gathering requirements, creating product concept, creating development tickets, coding, testing and so on.


200k lines of slop? And zero product to show for it…

This is starting to feel crypto to me. Not the light use of ai for work that most of use sane people see but these ridiculous claims of hundreds of thousands of lines of code with amazing results and zero substance backing it up. It is like the amazing scalability claims of some new blockchain which never materialize.

The only solace is that the only people getting scammed here are those paying money for the tools.


It is more than 200k lines of slop. 200k lines of code slop, and 80k lines of test slop.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: