The very first project I worked on at Cloudflare but in 2012 was a delta compression-based service called Railgun. We installed software both on the customer's web server and on our end and thus were able to automatically manage shared dictionaries (in this case version of pages sent over Railgun were used as dictionaries automatically). You definitely get incredible compression results.
Delta compression is a huge win for many applications, but it takes a careful hand to make it work well, and inevitably it gets deprecated as the engineers move on and bandwidth stops being a focus-- just like Railgun has been deprecated! https://blog.cloudflare.com/deprecating-railgun
Maybe the basic problem is with how hard it is to find engineers passionate about performance AND compression?
I don't think your characterization of why Railgun was deprecated is accurate. From the blog post you link to:
“I use Railgun for performance improvements.”
Cloudflare has invested significantly in performance upgrades in the eight years since the last release of Railgun. This list is not comprehensive, but highlights some areas where performance can be significantly improved by adopting newer services relative to using Railgun.
Cloudflare Tunnel features Cloudflare’s Argo Smart Routing technology, a service that delivers both “middle mile” and last mile optimization, reducing round trip time by up to 40%. Web assets using Argo perform, on average, 30% faster overall.
Cloudflare Network Interconnect (CNI) gives customers the ability to directly connect to our network, either virtually or physically, to improve the reliability and performance of the connection between Cloudflare’s network and your infrastructure. CNI customers have a dedicated on-ramp to Cloudflare for their origins.
Right, but isn't that part of the general trend of bandwidth becoming far cheaper in the last decade along with dynamic HTML becoming a smaller fraction of total transit?
A 95%+ reduction in bandwidth usage for dynamic server-side-rendered HTML is much less important in 2023 than 2013.
Unless you're part of the large majority of people in the world on slower mobile networks. We keep designing and building for people with broadband / wifi, and missing out just how big the 3G / lousy latency markets are.
I think it's related to the size of the Cloudflare network and how good its connectivity is (and our own fibre backbone). But on the eyeball side bandwidth isn't the only game in town: latency is the silent killer.
No. What Railgun did is it enabled the two sides of the connection to agree on a shared dictionary (the most recent version of the page being transmitted) and use that to compress the new page. It required both sides to keep a cache of page versions to compare against.
Even putting aside CORS because I don’t even want to think about how this plays well with requests to another (tracking?) domain, this still doesn’t seem worth it. The explicit use case seems to be that it basically tells the server when you last visited the site based on which dictionary you have and then it gives you the moral equivalent of a delta update. Except, most browsers are working hard to expire data of this kind for privacy reasons. What’s the lifetime of these dictionaries going to be? I can see it being ok if it’s like 1 day but if this outlives how long cookies are stored it’s a significant privacy problem. The user visits the site again and essentially a cookie gets sent to the server? The page says “don’t put user-specific data in the request” but like nobody is stopping a website from doing this.
I think fingerprinting using this is mostly like the more direct ways to fingerprint with the cache, and the defenses against one are the defenses against the other.
For the cross-site thing, cache partitioning is the defense. If the cache of facebook.com/file is independent for a.com and b.com, Facebook can't link the visits.
An attacker using the hash of a cached resource as a pseudo-cookie could previously use the content of the resource as the pseudo-cookie. The Use-As-Dictionary wildcard allows cleverer implementations, but it seems like you can fingerprint for the same time period/in the same circumstances as before. In both cases you might do your tracking by ignoring how you're supposed to be using the feature; as you note, no one's stopping you.
Before and after the compression feature, it is true anti-tracking laws, etc. should address tracking with persistent storage in general not only cookies, much as they need to handle localStorage or other hiding places for data. Also true that for a browser to robustly defend against linking two visits to the same domain (or limit the possibility of tracking to a certain time period, session, origin, etc.), caching is one of the things it has to limit.
I think if they get the expiry, partitioning, etc. right (or wrong) for stopping cache fingerprinting, they also get it right (or wrong) for this.
I was admittedly a fan of the original SDCH that didn't take off, figuring that inter-resource redundancy is a thing. It's a neat spin on it to use the compression algo history windows instead of purpose-built diff tools, and use the existing cache of instead of a dictionary store to the side. Seems easier to implement on both ends compared to the previous try. I could see this being helpful for quickly kicking off page load, maybe especially for non-SPAs and imperfectly optimized sites that repeat a not-tiny header across loads.
I think I’d feel better with a fixed set of dictionaries based on a corpus that gets updated every year to match new patterns of traffic and specifications. Even if it’s less efficient.
That wouldn’t make sense as it would be the user agent (aka your browser) that implements these shared dictionaries and they wouldn’t be able to add non-standard shared dictionaries for libs like react.
If they could do that then they might as well preload the cache with all common libs like react from well known cdn urls.
It's interesting this is mentioned specifically about the metadata used by this feature: fingerprinting using this feature has similarities with other cache fingerprinting (wrote a sibling comment about that).
It's not actively bad to have defense-in-depth measures at the level of the dictionary feature. But if your implementation of dictionaries using your browser's existing cache policies is a privacy problem, I'd consider changing the cache, not just the shared-dictionary implementation.
The dictionaries are partitioned by document and origin so a "tracking" domain will only be able to correlate requests within a given document origin and not across sites.
They are also cleared any time cookies are cleared and don't outlive what you can do today with cookies or Etags (and are using the most restrictive partitioning for that reason).
The original proposal for Zstd was to use a predefined stastically generated dictionary. Mozilla rejected the proposal for that.
But there's a lot of great discussion on what Zstd can do, whic.h is astoundingly flexible & powerful. There's discussion on dynamic adjustment if cinpression ratios. And discussion around shared dictionaries and their privacy implications. That Mozilla turned around & started supporting Zstd & has stamped a positive indicator, worth prototyping on shared dictionaries is a good initial stamp of approval to see! https://github.com/mozilla/standards-positions/issues/771
One of my main questions after reading this promising update is: how do pick what to include when generating custom dictionaries? Another comment mentions that brotli has a standard dictionary it uses, and that's some kind of possible starting place. But it feels like tools to build one's custom dictionary would be ideal.
I agree with other comments concerned with fingerprinting, and it was my second thought reading through the article. But my first thought was how beneficial this could be for return visitors of a web app, and how it could similarly benefit related concerns, such as managing local caches for offline service workers.
True, for documents (as is another comment’s focus) this is perhaps overkill. Although even there, a benefit could be imagined for a large body of documents—it’s unclear whether this case is addressed, but it certainly could be with appropriate support across say preload links[0]. But if “the web is for documents, not apps” isn’t the proverbial hill you’re prepared to die on, this is a very compelling story for web apps.
I don’t know if it’s so compelling that it outweighs privacy implications, but I expect the other browser engines will have some good insights on that.
Even in the "documents" case of the web there can be pretty significant savings if users tend to visit more than one page and they share some amount of structure.
On the first entry to the site you trigger the load of an external dictionary that contains the common parts of the HTML across the site and then future document loads can be delta-compressed against the dictionary, effectively delivering just the page-specific bits.
You need to amortize the cost of loading the dictionary across the other page loads but it's usually pretty compelling once users visit more than 2-3 pages.
This seems so ludicrous to me when all we really need is a way to share a resource reference across sites. Like “I need react 18.1 on this page, and the SHA should be abcdefghi “. If you don’t have it, I can give it to you from my server, or you can follow this link to a CDN, but the resource itself can be deduplicated based on the hashed contents instead of the URI. Why isn’t this a thing when basically everything uses frameworks nowadays? This shared dictionary seems like a more obtuse and roundabout way to solve these. If there was caching by hashes, browsers could even preload the latest versions of new libraries before any sites even referenced them.
One potential issue is tracking. By sharing caches across websites it becomes possible to use timing attacks to track different users. This is why browsers are working to isolate caches per site: https://developer.chrome.com/blog/http-cache-partitioning
How could a dictionary in the browsers that are pre-made with JS in mind fare? Aka instead of making a custom dictionary per resource I send to the user, I could say that "my scripts.js file uses the browser's built-in js-es2023-abc dictionary". So the browser's would have some dictionaries others could reuse.
What's the savings on that approach vs a gziped file without any dictionary?
So Brotli already contains a dictionary that is trained on web traffic. I think the thing here is that Google wants to make sending YouTube 1.1 more efficient if you already have YouTube 1.0, but they can’t put YouTube 1.0 into the browser.
You determine how far back you want to build deltas for. If you build deltas for the last 3 versions then you can send diffs for those users as well (as long as the dictionary hasn't expired). Or, you could just send the full response just like if dictionaries weren't supported.
Each site can decide what a "good" number of releases to build against based on typical release cycles and user visitation patterns.
The patch system I worked with generated signatures of each build. The signature had the hash of each block of the build. The client has the signature for their version (1.0) and they download the signature of the new version (1.2) and diff the two. Then they download each block that has changed.
I think it was the `electron-updater` for my electron app, but I don't quite remember now.
It was just a couple years ago that I learned that the Unix compression libraries have a flag to make them “rsync friendly”. They do something with compression blocks to make them more stable across changes. Normally a small change in the middle of a file could change the rest of the output, due to bit packing.
This seems like a possibly huge user/browser fingerprint. Yes, CORS has been taken into account, but for massive touch surface origins (Google, Facebook, doubleclick, etc) this certainly has concerning ramifications.
It’s also insanely complicated. All this effort, so many possible tuples of (shared dictionary, requested resource), none of which make sense to compress on-the-fly per-request, mean it’s specifically for the benefit of a select few sites.
When I saw the headline I thought that Chrome would ship with specific dictionaries (say one for js, one for css, etc) and advertise them and you could use the same server-side. But this is really convoluted.
Don't want to set session cookies? Just provide user-specific compression dictionaries and use them as your session id! After all, how is the user supposed to notice they got a different dictionary than everyone else
> I thought that Chrome would ship with specific dictionaries (say one for js, one for css, etc) and advertise them and you could use the same server-side. But this is really convoluted.
More convoluted, but I expect using an old version as the source for the dictionary will yield significantly better results than a generic dictionary for that type of file.
Of course it doesn't help the first load, which might be more noticeable than subsequent loads when not every object has been modified. Perhaps having a standard dictionary for each type for the first request and using a specific one when the old version if available, would give noticeable extra benefit for those first requests for minimal extra implementation effort.
> [...] mean it’s specifically for the benefit of a select few sites.
It does seem like the ones who benefit from this are large web application that often ship incremental changes. Which, to be fair are the ones that can use the most help.
This has the potential of moving the needle between: "the app takes 10 seconds to load" to "it loads instantly" for these scenarios. Say what you want about the fact that maybe they should optimize their stuff better, this does give them an easy out.
That being said, yeah this is really convoluted and does seem like a big fingerprinting surface.
Doesn't the fact that resources send different data mean that SRI(Subresource Integrity) checks cannot be performed? As for fingerprinting, it would not be a problem since it is the same as with Etag.
The savings are nice in the best case (like in TFA: switching from version 1.3.4 to 1.3.6 of a lib or whatever) but that Base64 encoded hash is not compressible and so this line basically adds 60+ bytes to the request.
Though from the client side 60 bytes is likely not really noticeable¹ as a delay in the request send. Perhaps server side, which is seen many many client requests, will see an uptick in incoming bandwidth used, but in most cases servers responding to HTTP(S) requests see a lot more outgoing traffic (response sizes are much larger than requests sizes, on average), so have enough incoming bandwidth “spare” that it is not going to be saturated to the point where this has a significant effect.
--
[1] if the link is slow enough that several lots of 60 bytes is going to have much effect² it likely also has such high latency that the difference is dwarfed by the existing delays.
[2] a spotty GRPS connection? is anything slower than that in common use anywhere?
Upload is usually slower, more latency sensitive, and suffers from tcp cold start. Pages also make lots of small requests, so header overhead can add up.
HTTP/2 added header compression for these reasons.
The only reason the client is even asking is that the server sent them a header saying it might be beneficial to do so.
And the client definitely has the dictionary data. The only thing it needs is for the server to accommodate the request after leading it down that path in the first place.
I can picture how it could happen, though. If you didn't realize the cost, you might not try to prevent misses. Or you could have a configuration error like sending the header but forgetting to generate pre-compressed data in your build.
If this is a significant issue, a server could collect stats and generate warnings about situations where it's not pulling its weight. Or even automatically disable it if hit rates are terrible.
This plus native web-components is an incredible advance for "the web".
Fingerprinting concerns aside (compression == timing attacks in the general case), the fact that it's nearly network-transparent and framework/webserver compatible is incredible!
What I really want: dictionaries derived from the standards and standard libraries (perhaps once a year or somesuch), which I'd use independently of build system gunk, and while it wouldn't be the tightest squeeze you can get, it would make my non-built assets get very close to built asset size for small to medium sized deployments.
Ah damn I thought this was going to be available to JavaScript. Would be amazing for one use case I have (an HTML page containing inline logs from a load of commands, many of which are substantially similar).
Maybe eventually (as a different spec). We've talked about wanting to support it in the DecompressionStream API or something similar at some point.
If you need it to be able to do compression though then it might be a harder sell since the browser doesn't ship with the compression code for zstd or brotli and would have to justify adding it.
There's wasm modules that do similar but having it bakes into the browser could allow for further optimization than what's possible with wasm. https://github.com/bokuweb/zstd-wasm
I have no idea if it's possible but I wonder if a webgpu port could be made? Alternatively, for your use case, maybe you could try applying something like Basis Universal; a fast compression system for textures, that it seems there are some webgpu loaders for... Maybe that could be bent to encoding/deciding text?
The part I'm missing is how these dictionaries are created. Can I use the homepage to create my dictionary, so all other pages that share html are better efficiently compressed? How?
For a delta update of one version of a resource to the next, the resource itself is the dictionary (i.e. JS file).
For stand-alone dictionaries, the brotli code on github has a dictionary_generator that you can use to generate a dictionary. You give it a dictionary size and a bunch of input files and it will generate one. I have a version of it hosted on https://use-as-dictionary.com/ that you can pass up to 100 URLs to and it will generate a dictionary for you (using the brotli tool).
yes, and technically so does x86, but there's a pretty big difference between formats where the data is normalized and expected to be correct and formats that are intended for users and need to do things like name resolution and error checking. Parsing a language made for machines is easy to do faster than you can read the data from ram, while parsing a high level language will often happen at <100mbps
How so? SDCH had sidechannel issues which is part of why it was unshipped. I don't know that someone won't find a way to attack it but the CORS requirement already requires that the dictionary and compressed-resource be readable and the dictionary has to be same-origin as the resources that it compresses.
Combined they mitigate the known dictionary-specific attack vectors.
With shared dictionaries you can compress everything down to under a byte.
Just put the to-be-compressed item into the shared dictionary, somehow distribute that to everyone, and then the compressed artifact consists of a reference to that item.
If the shared dictionary contains nothing else, it can just be a one-bit message whose meaning is "extract the one and only item out of the dictionary".
What stands out to me is that this creates another 'key' that the browser sends on every request which can be fingerprinted or tracked by the server.
I do not want my browser sending anything that looks like it could be used to uniquely identify me. Ever.
I want every request my browser makes to look like any other request made by another user's browser. I understand that this is what Google doesn't want but why can't they just be honest about it? Why come up with these elaborate lies?
Now to limit tracking exposure, in addition to running the AutoCookieDelete extension I'll have to go find some AutoDictionaryDelete extension to go with it. Boy am I glad the internet is getting better every day.
You're making three assertions, none backed by any evidence. That this is a tracking vector, that it's primarily intended to be a tracking vector, and that they're lying about their motivations.
But your reasoning fails already at the first step, since you just assumed malice rather than do any research. This is not a useful tracking vector. The storage is partitioned by the top window, and it is cleared when cookies are cleared. It's also not really a new tracking vector, it's pretty much the same as ETags.
https://blog.cloudflare.com/cacheing-the-uncacheable-cloudfl...
I am glad to see that things have moved on from SDCH. Be interesting to see how this measures up in the real world.