Agreed. This is a step in the right direction, but, "Firefox works on Mac", is still the easiest, most straightforward answer to anyone who's asking me, "how can I get a decent adblocker for the web?"
Having a shared API matters for extensions primarily because it makes it easier for a developer to say, "well, I might as well throw that on the Apple store as well." But if Safari literally can't do the things your extension needs it to do, then the API is kind of a secondary concern.
I haven't looked at AdGuard in specific, but fundamentally, AdGuard's Safari version can't be better than uBlock Origin on Firefox/Chrome, because Safari lacks the APIs to do what uBlock Origin does.
The immediate problem is that Safari has a hard limit of 50,000 rules per extension. Some developers do try to get around this with various hacky strategies like splitting their adblocker up into multiple extensions, or moving their blocker to the OS-level, or trying to recompile the list on the fly. But at the end of the day it's just a really weird hoop to jump through, and it ends up being kind of error-prone, and there's no guarantee that Apple won't break those hacks in the future.
More fundamental is the list itself -- Safari doesn't allow contextual, dynamic blocking, you have to put everything into a static list. This makes it impossible to build the kind of detailed filters[0][1] that uBlock Origin allows -- for example, there's no way to strip HTML content from a GET request in Safari before it gets parsed by the browser.
And so on. See the privacy API[2], dns[3], which is used to block a newer bypass method where 3rd-party trackers are hidden behind 1st-party subdomains. Worse CSS/JS injection, etc...
I'm not saying everyone else is crap at adblockers, I'm saying static blocklists are fundamentally less powerful and less capable than Firefox's APIs. That's especially true if AdGuard is doing something like system-wide DNS blocking. No shade at those tools, they have a place, but a browser-based extension will always be better than a setup like PiHole, because the browser has more access to a given request's current context.
There are a couple of reasons, but the biggest one is that intercepting HTML at the network level allows you to filter inline script tags from documents[0]. If you want to stop an inline script from running (say to defuse an anti-adblocker popup/redirect), you can't wait until after it gets parsed to remove it. Once it gets added to the page, the browser will execute it.
Of course, note that uBlock Origin also supports scriptlets[1] that can be used to rein in scripts after they get added to the page, instead of by doing resource rewriting. You're not restricted to only doing your filtering at the network level, HTML filtering is just another tool that filter authors have at their disposal.
Helpfully, at the scriptlet wiki page I've linked each entry also shows a practical example rule that relies on the scriptlet. You can also take a look at the uBlock Origin's default filter lists[2] if you want more examples of people using both HTML filtering and scriptlets in the wild. Many of these block rules are impossible to implement in Safari.
Sharing notion (dumb question) here, since you're smart about this stuff:
At what point will "ad blocking" flip to "content scrapping"? All this cat & mouse, arms race seems like a bad ROI.
I imagine an adaptive super reader view mode. Meaning emphasis on content scrapping rules vs ad blocking rules.
Take snapshots of top websites, do some rendered page aware content diffing, with some user directed setting & curation, infer where the content is, distill that down to "good enough" scrapping rules.
Any way. Thanks for answering some of my lingering questions.
I am not an expert on this stuff, I've just spent slightly more time reading some of the wikis than other people. Take what I say with a grain of salt, other people who are actually building ad blockers or managing blocklists would have more insight.
If you want to swap from a blacklist model to a whitelist model, there are a couple of problems to solve off the top of my head:
- you need a way to refer to content that supports re-hosting. You need a way to convert the Facebook/Twitter link someone shares into the scraped version without you loading the original link. See IPFS, but also see Invidious, Nitter, and the Internet Archive for a lower-tech, more straightforward version of what that might look like.
- you need good enough copyright exemptions that it's OK to re-host or proxy the content somewhere else. This is kind of a gray area, people are rehosting web content without getting sued, but it's not clear to me if that's scalable moving from sites like Youtube to the Internet as a whole. I guess nobody's called to take down Pocket, so maybe the situation there is better than I think.
- You need the web to stay relatively semantic. There are a lot of sites that don't work with Pocket/reader mode. A lot of the sites that do work are because there isn't a highly adversarial relationship between Pocket and site operators.
I'm not sure whether a cat-and-mouse game around content scraping would be better or worse than what we have. I can imagine it would be better in the sense that you'd only need to do it once per page, and then distribute the scraped version. But that's assuming that copyright would allow you to do that.
And I suspect that breaking a scraper is easier than breaking an ad blocker. But maybe someone could prove me wrong there.
I hadn't connected those dots. Thanks. Very interesting.
Tangentially, I've been chewing on an idea similar to quotebacks (recently on HN's front page). My naive take was to create URL shortener to support my use cases. For example, by shortener would attempt to link to OC, falling back to Internet Archive (or whatever). I'll now learn about IPFS, Invidious, Nitter.
Also, I didn't do a good job explaining my 1/2 baked notion for implementing a "whitelist" based content scrapper.
I imagined distributing the whitelist to be used by client's browsers to do the actual transformation locally. Your explanation of how uBlock can also rewrite HTML DOM client side sparked this line of thinking. For the whitelist's curators, my notion for capturing and diffing was providing tools, like a better debugger, which could also be used by front-end developers.
Back to your notion of server side processing, for some combo of caching, transformation, render: I love it.
Opera did something similar with their mini mobile browser, ages ago. The server would render pages to GIF (?) with an image map for interaction, pushed out to the mobile device. Squint a bit and that architecture resembles MS Remote Desktop, PC-Anywhere, X-Windows, etc.
I keep hoping someone trains some advertising hating AI, that will scrap content for me.
re cat-and-mouse
No doubt.
I have written a few scrapers in anger. On mostly structured data. Often mitigating compounding ambiguity from standards, specifications, tool stacks, and partner's implementations.
So I just tossed the traditional ETL stuff (mappings, schema compilers), treated the problem space as scrapping. Generally speaking, I used combos of "best match" and resilient xpath-like expressions (eg prefer relative to absolute paths) to find data.
No, I know that. The issue is that the actual app extension itself (which is written in Swift and useful for things like picking elements from the page, etc.) will pretend like it cannot function without the Electron app running. Some more details: https://github.com/AdguardTeam/AdGuardForSafari/issues/84#is...
I was getting several sites not working properly with AdGuard. Yeah you can add whitelists and such, but it’s annoying. Some of was it Safari not working.
In the end I just use a browser with built in blocking (AdBlockBrowser or now Brave) and leave Safari alone.
If adguard’s VPN adblocking solution is anything like NextDNS’ Then it doesn’t block ads on YouTube, which is a pretty big source of ads for the lay user.
It's not Open Source, and if it's running in Safari it doesn't have access to any of the APIs I mention elsewhere[0].
I'm sure it's a great extension with a lot of work put into it, but there is no realistic way 1blocker can match the best ad blockers in Chrome/Firefox unless it's hacking Safari behind the scenes to inject new APIs.
Doesn't Apple's content blocking API limit the number of rules at 50,000 rules per extension? uBlock Origin uses 85,000 rules from EasyList alone. That's without country-specific and privacy rules. I have about 180,000 active rules at the moment.
To elaborate further on your comment, you can programmatically enable/disable/change/recompile nested rules from your master extension.
I wonder if half the reason the 50k rule limit exists is so developers don't end up with an append-only infinite rule list that takes 20 seconds to recompile. If you're stuck turning 200k rules into four lists of 50k rules, you can just recompile each one on its own thread and all of it will take a couple seconds max.
That's certainly ideal, and for the most part, Safari content blockers work reasonably well. However, there are things you currently can't block using Safari content blockers that can be blocked with uBlock origin. One of the most prominent examples are CSS classes, and the inability to block them becomes a real pain if you open the mobile Reddit website. Their mobile website disables scrolling if you block their pesky "use our mobile app" popups.
Having a shared API matters for extensions primarily because it makes it easier for a developer to say, "well, I might as well throw that on the Apple store as well." But if Safari literally can't do the things your extension needs it to do, then the API is kind of a secondary concern.