I'm glad after spending all this time trying to increase power efficiency people have come up with JavaScript that serves no purpose other than to increase power draw.
I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.
This is rather unfortunate, but the way Anubis works, you will only get the PoW test once.
Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.
If you're visiting loads of different websites, that does suck, but most people won't be affected all that much in practice.
There are alternatives, of course. Several attempts at standardising remote attestation have been done. Apple has included remote attestation into Safari years ago. Basically, Apple/Google/Cloudflare give each user a limited set of "tokens" after verifying that they're a real person on real hardware (using TPMs and whatnot), and you exchange those tokens for website visits. Every user gets a load of usable tokens, but bots quickly run out and get denied access. For this approach to work, that means locking out Linux users, people with secure boot disabled, and things like outdated or rooted phones, but in return you don't get PoW walls or Cloudflare CAPTCHAs.
In the end, LLM scrapers are why we can't have nice things. The web will only get worse now that these bots are on the loose.
> Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.
Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire. Technical details aside, if a site becomes a worthy target, a scraping operation running on billions of dollars will easily bypass any restrictions thrown at it, be that cookies, PoW, JS, wasm, etc. Being able to access multiple sites by bypassing a single method is just a bonus.
Ultimately, I don't believe this is an issue that can be solved by technical means; any such attempt will solely result in continuous UX degradation for humans in the long term. (Well, it is already happening.) But of course, expecting any sort of regulation on the manna of the 2020s is just as naive... if anything, this just fits the ideology that the WWW is obsolete, and that replacing it with synthetic garbage should be humanity's highest priority.
> Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire. Technical details aside, if a site becomes a worthy target, a scraping operation running on billions of dollars will easily bypass any restrictions thrown at it, be that cookies, PoW, JS, wasm, etc. Being able to access multiple sites by bypassing a single method is just a bonus.
The reason why Anubis was created was that the author's public Gitea instance was using a ton of compute because poorly written LLM scraper bots were scraping its web interface, making the server generate a ton of diffs, blames, etc. If the AI companies work around proof-of-work blocks by not constantly scraping the same pages over and over, or by detecting that a given site is a Git host and cloning the repo instead of scraping the web interface, I think that means proof-of-work has won. It provides an incentive for the AI companies to scrape more efficiently by raising their cost to load a given page.
> Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire.
AFAIK, Anubis does not work alone, it works together with traditional per-IP-address rate limiting; its cookies are bound to the requesting IP address. If the scraper uses a new IP address for each request, it cannot reuse the cookies; if it uses the same IP address to be able to reuse the cookies, it will be restricted by the rate limiting.
As far as I know the creator of Anubis didn't anticipate such a widespread use and the anime girl image is the default. Some sites have personalized it, like sourcehut.
Attestation is a compelling technical idea, but a terrible economic idea. It essentially creates an Internet that is only viewable via Google and Apple consumer products. Scamming and scraping would become more expensive, but wouldn't stop.
It pains me to say this, but I think that differentiating humans from bots on the web is a lost cause. Proof of work is just another way to burn more coal on every web request, and the LLM oligarchs will happily burn more coal if it reduces competition from upstart LLMs.
Sam Altman's goal is to turn the Internet into an unmitigated LLM training network, and to get humans to stop using traditional browsing altogether, interacting solely via the LLM device Jony Ive is making for him.
Based on the current trajectory, I think he might get his way, if only because the web is so enshittified that we eventually won't have another way to reach mainstream media other than via LLMs.
"It pains me to say this, but I think that differentiating humans from bots on the web is a lost cause."
Ah, but this isn't doing that. All this is doing is raising friction. Taking web pages from 0.00000001 cents to load to 0.001 at scale is a huge shift for people who just want to slurp up the world, yet for most human users, the cost is lost in the noise.
All this really does is bring the costs into some sort of alignment. Right now it is too cheap to access web pages that may be expensive to generate. Maybe the page has a lot of nontrivial calculations to run. Maybe the server is just overwhelmed by the sheer size of the scraping swarm and the resulting asymmetry of a huge corporation on one side and a $5/month server on the other. A proof-of-work system doesn't change the server's costs much but now if you want to scrape the entire site you're going to have to pay. You may not have to pay the site owner, but you will have to pay.
If you want to prevent bots from accessing a page that it really wants to access, that's another problem. But, that really is a different problem. The problem this solves is people using small amounts of resources to wholesale scrape entire sites that take a lot of resources to provide, and if implemented at scale, would pretty much solve that problem.
It's not a perfect solution, but no such thing is on the table anyhow. "Raising friction" doesn't mean that bots can't get past it. But it will mean they're going to have to be much more selective about what they do. Even the biggest server farms need to think twice about suddenly dedicating hundreds of times more resources to just doing proof-of-work.
It's an interesting economic problem... the web's relationship to search engines has been fraying slowly but surely for decades now. Widespread deployment of this sort of technology is potentially a doom scenario for them, as well as AI. Is AI the harbinger of the scrapers extracting so much from the web that the web finally finds it economically efficient to strike back and try to normalize the relationship?
> Taking web pages from 0.00000001 cents to load to 0.001 at scale is a huge shift for people who just want to slurp up the world, yet for most human users, the cost is lost in the noise.
If you're going to needlessly waste my CPU cycles, please at least do some mining and donate it to charity.
Anubis author here. Tell me what I'm missing to implement protein folding without having to download gigabytes of scientific data to random people's browsers and I'll implement it today.
What if I turn off my computer? Does the client save its work (i.e. checkpoint)?
> Periodically, the core writes data to your hard disk so that if you stop the client, it can resume processing that WU from some point other than the very beginning. With the Tinker core, this happens at the end of every frame. With the Gromacs core, these checkpoints can happen almost anywhere and they are not tied to the data recorded in the results. Initially, this was set to every 1% of a WU (like 100 frames in Tinker) and then a timed checkpoint was added every 15 minutes, so that on a slow machine, you never lose more that 15 minutes work.
> Starting in the 4.x version of the client, you can set the 15 minute default to another value (3-30 minutes).
caveat: I have no idea how much data "1 frame" is.
You can't do anything useful with checkpoints due to the same-site origin problem. Unless you can get browser support for some sort of proof of work that did something useful that whole line is a non-starter. No single origin involves a useful amount of work.
The problem is that this problem is going to be all overhead. If you sit down and calmly work out the real numbers, trying to distribute computations to a whole bunch of consumer-grade devices, where you can probably only use one core for maybe two seconds at a time a few times an hour, you end up with it being cheaper to just run the computation yourself. My home gaming PC gets 16 CPU-hours per hour, or 56700 CPU-seconds. (Maybe less if you want to deduct a hyperthreading penalty but it doesn't change the numbers that much.) Call it 15,000 people needing to run 3-ish of these 2-second computations, plus coordination costs, plus serving whatever data goes with the computation, plus infrastructure for tracking all that and presumably serving, plus if you're doing something non-trivial a quite non-trivial portion of that "2 seconds" I'm shaving off for doing work will be wasted setting it up and then throwing it away. The math just doesn't work very well. Flat-out malware trying to do this on the web never really worked out all that well, adding the constraint of doing it politely and in such small pieces doesn't work.
And that's ignoring things like you need to be able to prove-the-work for very small chunks. Basically not a practically solvable problem, barring a real stroke of genius somewhere.
People are using LLMs because search results (due to SEO overload, Google's bad algorithm etc) are terrible, Anubis makes these already bad search results even worse by trying to block indexing, meaning people will want to use LLMs even more.
So the existence of Anubis will mean even more incentive for scraping.
> This is rather unfortunate, but the way Anubis works, you will only get the PoW test once.
Actually I will get it zero times because I refuse to enable javashit for sites that shouldn't need it and move on to something run by someone competent.
There's lots of ways to define "shouldn't" in this case
- Shouldn't need it, but include it to track you
- Shouldn't need it, but include it to enhance the page
- Shouldn't need it, but include it to keep their costs down (for example, by loading parts of the page dynamically / per person and caching the rest of the page)
- Shouldn't need it, but include it because it help stop the bots that are costing them more than the site could reasonably expected to make
I get it, JS can be used in a bad way, and you don't like it. But the pillar of righteousness that you seem to envision yourself standing on it not as profound as you seem to think it is.
Well, everything’s a tradeoff. I know a lot of small websites that had to shut down because LLM scraping was increasing their CPU and bandwidth load to the point where it was untenable to host the site.
> I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.
I dunno. How much work do you really need in PoW systems to make the scrapers go after easier targets? My guess is not so much that you impair a human's UX. And if you do, then you have not fine-tuned your PoW algo, or you have very determined adversaries / scrapers.
As has been stated multiple times in this thread and basically any thread involving conversation on the topic, a PoW with a negligible cost (either of time/money/pain-in-the-ass factor) will not impact end users, but will affect LLM scrapers due to the scales involved.
The problem is trying to create a PoW that actually fits that model, is economical to implement, and can't easily be gamed.
But saying "any" seems to imply that it's a theoretical impossibility ("any machine that moves will encounter friction and lose energy to heat conversion, ergo perpetual motion machines are impossible"), when in fact it's a theoretical possibility, just not yet a practical reality.
My phone is a piece of junk from 8 years ago and I haven't noticed any degradation in browsing experience. A website takes like two extra seconds to load, not a big deal.
I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.