Hacker Newsnew | past | comments | ask | show | jobs | submit | parsabg's commentslogin

I just finished reading da Vinci's biography by Walter Isaacson, which left me with a different sense of what it means to finish a piece of work. He famously never "finished" anything and eventually abandoned most of the projects he started.

He worked on the Mona Lisa for 16 years, adding a brush stroke here and there until his death, never handing it to the wool merchant who commissioned it or his wife who was the subject of the painting.

His work is largely a collection of drafts and anti-*'s, but that hasn't taken away from his transformative role in the history of art, science, and engineering. There is beauty in unfinished work and in what we abandon. Finality is not necessary for greatness.


I built a very similar extension [1] a couple of months ago that supports a wide range of models, including Claude, and enables them to take control of a user's browser using tools for mouse and keyboard actions, observation, etc. It's a fun little project to look at to understand how this type of thing works.

It's clear to me that the tech just isn't there yet. The information density of a web page with standard representations (DOM, screenshot, etc) is an order of magnitude lower than that of, say, a document or piece of code, which is where LLMs shine. So we either need much better web page representations, or much more capable models, for this to work robustly. Having LLMs book flights by interacting with the DOM is sort of like having them code a web app using assembly. Dia, Comet, Browser Use, Gemini, etc are all attacking this and have big incentives to crack it, so we should expect decent progress here.

A funny observation was that some models have been clearly fine tuned for web browsing tasks, as they have memorized specific selectors (e.g. "the selector for the search input in google search is `.gLFyf`").

[1] https://github.com/parsaghaffari/browserbee


It is kind of funny how the systems are set up where there often is dense and queryable information out there already for a lot of these tasks, but these are ignored in favor of the difficult challenge of brute forcing the human consumer facing ui instead of some existing api that is designed to be machine readable already. E.g. booking flights. Travel agents use software that queries all the airlines ticket inventory to return flight information to you the consumer. The issue of booking a flight is theoretically solved already by virtue of these APIs that already exist to do just that. But for AI agents this is now a stumbling block because it would presumably take a little bit of time to craft out a rule to cover this edge case and return far more accurate information and results. Consumers with no alternative don't know what they are missing so there is no incentive to improve this.


To add to this, it is even funnier how travel agents undergo training in order to be able to interface with and operate the “machine readable“ APIs for booking flight tickets.

What a paradoxical situation now emerges, where human travel agents still need to train for the machine interface, while AI agents are now being trained to take over the human jobs by getting them to use the consumer interfaces (aka booking websites) available to us.


This is exactly the conversation I had with a colleague of mine. They were excited about how LLMs can help people interact with data and visualize it nicely, but I just had to ask - with as little snark as possible - if this wasn't what a monitor and a UI were already doing? It seems like these LLMs are being used as the cliche "hammer that solves all the problems" where problems didn't even exist. Just because we are excited about how an LLM can chew through formatted API data (which is hard for humans to read) doesn't mean that we didn't already solve this with UIs displaying this data.

I don't know why people want to turn the internet into a turn-based text game. The UI is usually great.


I’ve been thinking about this a lot too, in terms of signal/noise. LLMs can extract signal from noise (“summarize this fluff-filled 2 page corporate email”) but they can also create a lot of noise around signal (“write me a 2 page email that announces our RTO policy”).

If you’re using LLMs to extract signal, then the information should have been denser/more queryable in the first place. Maybe the UI could have been better, or your boss could have had better communication skills.

If you’re using them to CREATE noise, you need to stop doing that lol.

Most of the uses of LLMs that I see are mostly extracting signal or making noise. The exception to these use cases is making decisions that you don’t care about, and don’t want to make on your own.

I think this is why they’re so useful for programming. When you write a program, you have to specify every single thing about the program, at the level of abstraction of your language/framework. You have to make any decision that can’t be automated. Which ends up being a LOT of decisions. How to break up functions, what you name your variables, do you map/filter or reduce that list, which side of the API do you format the data on, etc. In any given project you might make 100 decisions, but only care about 5 of them. But because it’s a program, you still HAVE to decide on every single thing and write it down.

A lot of this has been automated (garbage collectors remove a whole class of decision making), but some of it can never be. Like maybe you want a landing page that looks vaguely like a skate brand. If you don’t specifically have colors/spacing/fonts all decided on, an LLM can make those decisions for you.


That's a nice way of explaining it. I also feel like some sort of LLM purist by being critical of features that serve only to pollute emails and comms with robotic text not written by an actual person. We will as societies have to come up with a new metric for TL;DR or "this was a perfectly cohesive and concise text", since LLMs have obscured the line.


This was the Rabbit R1's connundrum. Uber/DoorDash/Spotify have APIs for external integration, but they require business deals and negociations.

So how to evade talking to the service's business people ? Provide a chain of Rube Goldberg machines to somewhat use these services as if it was the user. It can then be touted as flexibility, and blame the state of technology when it inevitably breaks, if it even worked in the first place.


This is definitely true but there are more reasons that explain why so many teams choose the seemingly irrational path. First, so many APIs are designed differently, so even if you decide the business negotiation is worth it you have development work ahead. Second, tons of vendors don’t even have an API. So the thought of building a tool once is appealing


Those are of course valid points. The counterpart being that a vendor might not have an API because they actively don't want to (Twitter/X for instance...), and when they have one, clients trying to circumvent their system to basically scrape the user UX won't be welcomed either.

So most of the time that path of "build a tool once" will be adversarial towards the service, which will be incentivized to actively kill your ad-hoc integration if they can without too much collateral damage.


This is a massive problem in healthcare, at least here in Canada. Most of the common EMRs doctors and other practitioners use either don’t have APIs, or if APIs exist they are closely guarded by the EMR vendors. And EMRs are just one of the many software tools clinics have to juggle.

I’d argue that lack of interoperability is one of the biggest problems in the healthcare system here, and getting access to data through the UI intended for humans might just end up being the only feasible solution.


I’m not sure how unique or a new problem this is first individually to me and then generally.

Automation technologies to handle things like UI automation have existed long before LLMs and work quite fine.

Having an intentionally imprecise and non deterministic software try to behave in a deterministic manner like all software we’re used to is something else.


The people that use these UIs are already imprecise and non deterministic, yet that hasn’t stopped anyone from hiring them.

The potential advantage of using non-deterministic AI for this is that 1) “programming” it to do what needs to be done is a lot easier, and 2) it tends to handle exceptions more gracefully.

You’re right that the approach is nothing new, but it hasn’t taken off, arguably at least in part because it’s been too cumbersome to be practical. I have some hope that LLMs will help change this.


The cost to develop and maintain UI automation is prohibitive for most companies


It begs the question though. If these vendors are so closely guarded of their API to try and shake down people for an enterprise license, why would they suddenly be permissive towards the LLM subverting that payment flow? Chances are the fact the LLM can interact with these systems is a blip: once they do see appreciable adoption the systems will be locked down to prevent the LLM from essentially pirating your service for you.


It's because of legacy systems and people who basically have a degenerate attitude toward user interface/ user experience. They see job security in a friction heavy process. Hence the "brute forcing".. easier that than appealing to human nature


Dude you do not understand how bad those "APIs" are for booking flights. Customers of Travelport often have screen reading software that reads/writes to a green screen. There's also tele-type, but like most of the GDS providers use old IBM TPF mainframes.

I spent the first two years of my career in the space, we joked anything invented post Michael Jackson's song Thriller wasn't present.


Somewhere in the world there is someone crying while using QIK…


And yet, they exist, and software has been built on top of them already.


Those APIs aren't generally available to the public, are they?


Not always, but anthropic is not exactly the public either.


Just dumping the raw DOM into the LLM context is brutal on token usage. We've seen pages that eat up 60-70k tokens when you include the full DOM plus screenshots, which basically maxes out your context window before you even start doing anything useful.

We've been working on this exact problem at https://github.com/browseros-ai/BrowserOS. Instead of throwing the entire DOM at the model, we hook into Chromium's rendering engine to extract a cleaner representation of what's actually on the page. Our browser agents work with this cleaned-up data, which makes the whole interaction much more efficient.


Maybe people will start making simpler/smaller websites in order to work better with AI tools. That would be nice.


You just need to capture the rendering and represent that.


Playwrights MCP went had a strong idea to default to the accessibility tree instead of DOM. Unfortunately, even that is pretty chonky.


This is really interesting. We've been working on a smaller set of this problem space. We've also found in some cases you need to somehow pass to the model the sequence of events that happen (like a video of a transition).

For instance, we were running a test case on a e commerce website and they have a random popup that used to come up after initial Dom was rendered but before action could be taken. This would confuse the LLM for the next action it needed to take because it didn't know the pop-up came up.


It could work simmilar to Claude Code right? Where it won't ingest the entire codebase, rather search for certain strings or start looking at a directed location and follow references from there. Indeed it seems infeasible to ingest the whole thing.


The LLM should not be seeing the raw DOM in its context window, but a highly simplified and compact version of it.

In general LLMs perform worse both when the context is larger and also when the context is less information dense.

To achieve good performance, all input to the prompt must be made as compact and information dense as possible.

I built a similar tool as well, but for automating generation of E2E browser tests.

Further, you can have sub-LLMs help with compacting aspects of the context prior to handing it off to the main LLM. (Note: it's important that, by design, HTML selectors cannot be hallucinated)

Modern LLMs are absolutely capable of interpreting web pages proficiently if implemented well.

That being said, things like this Claude product seem to be fundamentally poorly designed from both a security and general approach perspective and I don't agree at all that prompt engineering is remotely the right way to remediate this.

There are so many companies pushing out junk products where the AI is just handling the wrong part of the loop and pulls in far too much context to perform well.


This is exactly it! We built a browser agent and got awesome results by designing the context in a simplified/compact version + using small/efficient LLMs - it's smooth.sh if you'd like to try


> The LLM should not be seeing the raw DOM in its context window, but a highly simplified and compact version of it.

Precisely! There is already something accessibility tree that Chromium rendering engine constructs which is a semantically meaningful version of the DOM.

This is what we use at BrowserOS.com


Is it just me, or do both of my sibling comments pitching competing AI projects read like they're written by (the same underlying) AI?


You're exactly right! I see the problem now.


It's not just an ad; it is a fundamental paradigm shift.


> Having LLMs book flights by interacting with the DOM is sort of like having them code a web app using assembly.

The DOM is merely inexpensive, but obviously the answer can't be solely in the DOM but in the visual representation layer because that's the final presentation to the user's face.

Also the DOM is already the subject of cat and mouse games, this will just add a new scale and urgency to the problem. Now people will be putting fake content into the DOM and hiding content in the visual layer.


It also surely leaves more room for prompt injection that the user can’t see


I had the same thought that really an LLM should interact with a browser viewport and just leverage normal accessibility features like tabbing between form fields and links, etc.

Basically the LLM sees the viewport as a thumbnail image and goes “That looks like the central text, read that” and then some underlying skill implementation selects and returns the textual context from the viewport.


I'm trying to build an automatic form filler (not just web-forms, any form) and I believe the secret lies in just chaining a whole bunch of LLM, OCR, form understanding and other API's together to get there.

Just 1 LLM or agent is not going to cut it at the current state of art. Just looking at the DOM/clientside source doesn't work, because you're basically asking the LLM to act like a browser and redo the website rendering that the browser already does better (good luck with newer forms written in Angular bypassing the DOM). IMO the way to go is have the toolchain look at the forms/websites in the same way humans do (purely visually AFTER the rendering was done) and take it from there.

Source: I tried to feed web source into LLMs and ask them to fill out forms (firefox addon), but webdevs are just too creative in the millions of ways they can ask for a simple freaking address (for example).

Super tricky anyway, but there's no more annoying API than manually filling out forms, so worth the effort hopefully.


> It's clear to me that the tech just isn't there yet.

Totally agree. This was the thesis behind MCP-B (now WebMCP https://github.com/MiguelsPizza/WebMCP)

HN Post: https://news.ycombinator.com/item?id=44515403

DOM and visual parsing are dead ends for browser automation. Not saying models are bad; they are great. The web is just not designed for them at all. It's designed for humans, and humans, dare I say, are pretty impressive creatures.

Providing an API contract between extensions and websites via MCP allows an AI to interact with a website as a first-class citizen. It just requires buy-in from website owners.

It's being proposed as a web standard: > https://github.com/webmachinelearning/webmcp


I suspect this kind of framework will be adopted by websites with income streams that are not dependent on human attention (i.e. advertising revenue, mostly). They have no reason to resist LLM browser agents. But if they’re in the business of selling ads to human eyeballs, expect resistance.

Maybe the AI companies will find a way to resell the user’s attention to the website, e.g. “you let us browse your site with an LLM, and we’ll show your ad to the user.”


Even the websites whose primary source of revenue is not ad impressions might be resistant to let the agents be the primary interface through which users interact with their service.

Instacart currently seems to be very happy to let ChatGPT Operator use its website to place an order (https://www.instacart.com/company/updates/ordering-groceries...) [1]. But what happens when the primary interface for shopping with Instacart is no longer their website or their mobile app? OpenAI could demand a huge take rate for orders placed via ChatGPT agents, and if they don't agree to it, ChatGPT can strike a deal with a rival company and push traffic to that service instead. I think Amazon is never going to agree to let other agents use its website for shopping for the same reason (they will restrict it to just Alexa).

[1] - the funny part is the Instacart CEO quit shortly after this and joined OpenAI as CEO of Applications :)


The side-panel browser agent is a good middle ground to this issue. The user is still there looking at the website via their own browser session, the AI just has access to the specific functionality which the website wants to expose to it. The human can take over or stop the AI if things are going south.


The Primary client for WebMCP enabled websites is a chrome extension like Claude Chrome. So the human is still there in the loop looking at the screen. MCP also supports things like elicitation so the website could stop the model and request human input/attention


> humans, dare I say, are pretty impressive creatures

Damn straight. Humanism in the age of tech obsession seems to be contrarian. But when it takes billions of dollars to match a 5 year-old’s common sense, maybe we should be impressed by the 5 year old. They are amazing.


Just took a quick glance at your extension and observed that it's currently using the "debugger" permission. What features necessitated using this API rather than leveraging content scripts and less invasive WebExtensions APIs?


How do screen readers work? I’ve used all the aria- attributes to make automation/scraping hopefully more robust, but don’t have experience beyond that. Could accessibility attributes also help condense the content into something more manageable?


Do we regret, yet, letting the Semantic Web wither on the vine?


It didn't really wither on the vine, it just moved to JSON REST APIs with React as the layer that maps the model to the view. What's missing is API discovery which MCP provides.

The problem with the concept is not really the tech. The problem is the incentives. Companies don't have much incentive to offer APIs, in most cases. It just risks adding a middleman who will try and cut them out. Not many businesses want to be reduced to being just an API provider, it's a dead end business and thus a dead end career/lifestyle for the founders or executives. The telcos went through this in the early 2000s where their CEOs were all railing against a future of becoming "dumb pipes". They weren't able to stop it in the end, despite trying hard. But in many other cases companies did successfully avoid that fate.

MCP+API might be different or it might not. It eliminates some of the downsides of classical API work like needing to guarantee stability and commit to a feature set. But it still poses the risk of losing control of your own brand and user experience. The obvious move is for OpenAI to come along and demand a rev share if too many customers are interacting with your service via ChatGPT, just like Google effectively demand a revshare for sending traffic to your website because so many customers interact with the internet via web search.


You might get it when bots write pages.


/s no, because if it doesn't help people consume it is its NOT important.


I think this will fail for the same reason RSS failed - the business case just isn't there.


Super cool


Shocking no one has mentioned Jian Yang's hotdog app :) [1]

[1] https://www.youtube.com/watch?v=tWwCK95X6go&ab_channel=Felix

Love the simplicity of this.


I wonder if the same would also be true for immunosuppressants administered for autoimmune conditions. Given they mostly interact with the signaling pathways, I guess in theory they should also be more effective in the morning if there is more immune cell activity going on.


Good analysis. Would it make sense to look at cumulative capital raised in addition to whether the companies have raised a Series A, to account for large seed rounds which don't seem uncommon with this cohort of companies? Series A as a milestone could obscure some details, e.g. company has raised a small seed round previously so the next round is labelled as series A, or company has raised a large seed round so doesn't need a series A within 24 months.


I've been working on a browser use agent embedded within a Chrome extension: https://github.com/parsaghaffari/browserbee

You can use it to check and summarize news and social media, fill out forms, send messages, book holidays, do your online shopping, conduct research, and pretty much anything else that can be done within a browser.


Nice - would you be interested in making this into a saas service? We are starting to open up our "automation API" and could maybe work together on bringing your extension to something that does not use playwright and a remote browser.


The raison d'être for BrowserBee is to control the user's browser in a private fashion so they can automate tasks that require them to be logged in. I'm unsure how it would work in a remote browser setup - tools like Browser Use and BrowserBase seem to cover that use case already.

The key differentiator is privacy and local control. When users need to automate tasks on sites where they're already authenticated (banking, personal accounts, work systems), they need their actual browser with their existing sessions and cookies, not a remote instance.


Looks powerful at least for read only use cases. Will have a look and compare token stats. Thanks


Great suggestion, will add custom Ollama configurations to the next release


It can fill forms - the agent can invoke a large number of tools to both observe and interact with a page


How does it do so? Just DOM manipulation, viewport scanning or something of the sort?


Is that with 2.5 Flash? I got that error intermittently with that mode, but the other Gemini models worked fine. I'll investigate


ah yea 2.0 flash is working. 2.5 doesn't and OpenAi 4.0 and mini models dont work either. The error message should probably say to try other models because i was pretty confused


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: