The proxy pattern here is clever - essentially treating the LLM context window as an untrusted execution environment and doing credential injection at a layer it can't touch.
One thing I've noticed building with Claude Code is that it's pretty aggressive about reading .env files and config when it has access. The proxy approach sidesteps that entirely since there's nothing sensitive to find in the first place.
Wonder if the Anthropic team has considered building something like this into the sandbox itself - a secrets store that the model can "use" but never "read".
The accounting pain nicbou mentioned is real. Bank reconciliation seems simple - two lists, match them - but then you hit timing differences where something cleared on different dates in each system, or description mismatches where the bank shows "PAYPAL *ACME" but you recorded "Acme Ltd - Invoice 4521".
Transaction categorisation is arguably worse because there's no universal standard. What one accountant calls "Office Expenses" another puts in "General Admin" - both correct for their context. Any automation that works for one client's books tends to break when you switch to another.
The timezone handling alone makes Temporal worth the wait. I've lost count of how many bugs I've shipped because Date silently converts everything to local time when you least expect it.
The ZonedDateTime type is the real win here - finally a way to say "this is 3pm in New York" and have it stay 3pm in New York when you serialize and deserialize it. With Date you'd have to store the timezone separately and reconstruct it yourself, which everyone gets wrong eventually.
Only downside I can see is the learning curve. Date was bad but it was consistently bad in ways we all memorized. Temporal is better but also much larger - lots of types to choose between.
The DoorDash pizza arbitrage comparison is apt. Both cases expose the same fundamental thing: venture-subsidised pricing creates artificial market conditions that clever people will exploit.
What I find interesting is how long these windows stay open. You'd think someone at Stamps.com or UPS would notice the pricing anomaly, but large organisations are often too siloed. The team setting international rates probably doesn't talk to whoever monitors small parcel economics.
The author mentions making a few hundred dollars - but the real question is scalability. At what volume does this become attractive enough for the postal services to close the loophole? There's probably a sweet spot between "not worth their attention" and "actually profitable."
Building CodeIQ - an AI tool that automates transaction coding for accountants and bookkeepers.
The interesting technical bit: it analyses your historic general ledger to reverse-engineer how you specifically categorise transactions. So instead of generic rules, it learns your firm's actual patterns - "oh, they always code Costa Coffee to Staff Welfare, not Refreshments" - that kind of thing.
Posts directly to Xero, QuickBooks, Sage, and Pandle. The VAT handling turned out to be surprisingly gnarly (UK tax rules are... something).
Been working on it about 6 months now. Still figuring out the right balance between automation confidence and "just flag this for human review".
The point about grid size assumptions is interesting. I studied physics and one of the first things you learn is that your choice of coordinate system can make a problem trivially easy or impossibly hard. Same underlying reality, wildly different solution paths.
Reminded me of a pattern I keep seeing in business software - teams spend months optimizing the wrong abstraction. They'll build an incredibly efficient data pipeline that turns out to process information nobody actually needs, or an algorithm that minimizes compute time when the real bottleneck is waiting for a human approval that takes 3 days.
The simulated annealing approach wasn't wrong per se - it's just that "minimise distance walked" was never actually the objective function that mattered to the humans doing the walking.
The measurement problem here is real. "10x faster" compared to what exactly? Your best day or your average? First-time implementation or refactoring familiar code?
I've noticed my own results vary wildly depending on whether I'm working in a domain where the LLM has seen thousands of similar examples (standard CRUD stuff, common API patterns) versus anything slightly novel or domain-specific. In the former case, it genuinely saves time. In the latter, I spend more time debugging hallucinated approaches than I would have spent just writing it myself.
The atrophy point is interesting though. I wonder if it's less about losing skills and more about never developing them in the first place. Junior developers who lean heavily on these tools might never build the intuition that comes from debugging your own mistakes for years.
The quality variation from month to month has been my experience too. I've noticed the models seem to "forget" conventions they used to follow reliably - like proper error handling patterns or consistent variable naming.
What's strange is sometimes a fresh context window produces better results than one where you've been iterating. Like the conversation history is introducing noise rather than helpful context. Makes me wonder if there's an optimal prompt length beyond which you're actually degrading output quality.
Remember that the entire conversation is literally the query you’re making, so the longer it is the more you’re counting on the rational comprehension abilities of the AI to follow it and determine what is most relevant.
The benchmark point is interesting but I think it undersells what the complexity buys you in practice. Yes, a minimal loop can score similarly on standardised tasks - but real development work has this annoying property of requiring you to hold context across many files, remember what you already tried, and recover gracefully when a path doesn't work out.
The TODO injection nyellin mentions is a good example. It's not sophisticated ML - it's bookkeeping. But without it, the agent will confidently declare victory three steps into a ten-step task. Same with subagents - they're not magic, they're just a way to keep working memory from getting polluted when you need to go investigate something.
The 200-line version captures the loop. The production version captures the paperwork around the loop. That paperwork is boring but turns out to be load-bearing.
This site has gone full Tower of Babel. I've seen at least a thousand "AI comment" callouts on this site in the last month and at this point I'm pretty sure 99% of them are wrong.
In fact, can someone link me to a disputed comment that the consensus ends up being it's actually AI? I don't think I've seen one.
You know how the chicken sexers do their thing, but can't explain it? Like they can't write a list of things they check for. And when they want to train new people they have them watch (apprentice style) the current ones, and eventually they also become good at doing it themselves?
It's basically that. I can't explain it (I tried listing the tells in a comment below), but it's not just a list of things you notice. You notice the whole message, the cadence, the phrases that "add nothing". You play with enough models, you see enough generations and you start to "see it".
If you'd like to check for yourself, check that user's comment history. It will become apparent after a few messages. They all have these tells. I don't know how else to explain it, but it's there.
Yeah on a second look GP might actually be on to something here. Jackfranklyn only makes top level comments, never dialogs with anyone, and I count at least 3 instances of "as someone who does this for a living" that are too seperated in scope to be plausibly realistic.
You might notice I wasn't responding to your specific claim about a particular comment but to a later post by a different poster commenting on a wider phenomenon. Perhaps stop trying so hard to insert the idea you want to argue against into posts where it doesn't actually exist just so you can have something to argue about. (Especially given there are many direct responses to your post actually arguing with your claim that you could instead argue with.)
The tells are in the cadence. And the not x but y. And the last line that basically says nothing, while using big words. It's like "In conclusion", but worded differently. Enough tells for me to click on their history. They have the exact same cadence on every comment. It's a bit more sophisticated than "chatgpt write a reply", but it's still 100% aigen. Check it out, you'll see it after a few messages in their history.
No, it doesn't. The "I'm an expert at AI detection" crowd likes to cite things like "It's not X, it's Y" and other expression patterns without stopping to think that perhaps LLMs regurgitate those patterns because they are frequently used in written speech.
I assign a <5% probability that GP comment was AI written. It's easy to tell, because AI writing has no soul.
The message is 100% AI written. And if you click on their username and check their comment history you'll see that ALL their comments are "identical". Just do it, you'll see it by the 5th message. No one talks like that. No one talks like that on every message.
Exactly, if a comment just feels a little off but you're unsure, do a quick scan of the profile, takes 15-30 seconds at most to get sufficient signal.
If it's actually AI, the pattern becomes extremely obvious reading them back-to-back. If no clear pattern, I'll happily give them the benefit of the doubt at that point. I don't particularly care if someone occasionally cleans up a post with an LLM as long as there is a real person driving it and it's not overused.
The other day on Reddit I saw a post in r/sysadmin that absolutely screamed karma farming AI and it was really depressing seeing a bunch of people defending them as the victim of an anti-AI mob without noticing the entire profile was variations of generic "Does anyone else dislike [Tool X], am I alone? [generic filler] What does everyone else think?" posts.
Looking at their profile I'm inclined to agree. But I think in isolation, this one post isn't setting off enough red flags for me. At the very least, they aren't just using default prompts.
I think at this point it's not easy to accurately detect whether or not something is AI written. A real person can definitely write like this. In fact, that's probably where the LLMs got their writing style from.
This resonates with something I noticed in client work. A surprising number of "urgent" requests resolve themselves if you wait a day - the person either figures it out, realises they asked the wrong question, or the underlying situation changes.
The tricky part is building enough trust that people don't feel ignored. I've started replying with "I'll look at this tomorrow" rather than going silent. Same delay, but it signals intentionality. People seem fine waiting when they know you've acknowledged the request.
Though I'll admit the line between strategic delay and just being slow is thin when you're managing multiple things at once.
One thing I've noticed building with Claude Code is that it's pretty aggressive about reading .env files and config when it has access. The proxy approach sidesteps that entirely since there's nothing sensitive to find in the first place.
Wonder if the Anthropic team has considered building something like this into the sandbox itself - a secrets store that the model can "use" but never "read".
reply