Compression efficiency with shared dictionaries in Chrome

jgrahamc · on March 6, 2024

The very first project I worked on at Cloudflare but in 2012 was a delta compression-based service called Railgun. We installed software both on the customer's web server and on our end and thus were able to automatically manage shared dictionaries (in this case version of pages sent over Railgun were used as dictionaries automatically). You definitely get incredible compression results.

https://blog.cloudflare.com/cacheing-the-uncacheable-cloudfl...

I am glad to see that things have moved on from SDCH. Be interesting to see how this measures up in the real world.

Scaevolus · on March 6, 2024

Delta compression is a huge win for many applications, but it takes a careful hand to make it work well, and inevitably it gets deprecated as the engineers move on and bandwidth stops being a focus-- just like Railgun has been deprecated! https://blog.cloudflare.com/deprecating-railgun

Maybe the basic problem is with how hard it is to find engineers passionate about performance AND compression?

jgrahamc · on March 6, 2024

I don't think your characterization of why Railgun was deprecated is accurate. From the blog post you link to:

“I use Railgun for performance improvements.”

Cloudflare has invested significantly in performance upgrades in the eight years since the last release of Railgun. This list is not comprehensive, but highlights some areas where performance can be significantly improved by adopting newer services relative to using Railgun.

Cloudflare Tunnel features Cloudflare’s Argo Smart Routing technology, a service that delivers both “middle mile” and last mile optimization, reducing round trip time by up to 40%. Web assets using Argo perform, on average, 30% faster overall.

Cloudflare Network Interconnect (CNI) gives customers the ability to directly connect to our network, either virtually or physically, to improve the reliability and performance of the connection between Cloudflare’s network and your infrastructure. CNI customers have a dedicated on-ramp to Cloudflare for their origins.

Scaevolus · on March 6, 2024

Right, but isn't that part of the general trend of bandwidth becoming far cheaper in the last decade along with dynamic HTML becoming a smaller fraction of total transit?

A 95%+ reduction in bandwidth usage for dynamic server-side-rendered HTML is much less important in 2023 than 2013.

Twirrim · on March 6, 2024

Unless you're part of the large majority of people in the world on slower mobile networks. We keep designing and building for people with broadband / wifi, and missing out just how big the 3G / lousy latency markets are.

jgrahamc · on March 6, 2024

I think it's related to the size of the Cloudflare network and how good its connectivity is (and our own fibre backbone). But on the eyeball side bandwidth isn't the only game in town: latency is the silent killer.

lynguist · on March 6, 2024

I might be naive but isn’t that what rsync is doing?

jgrahamc · on March 7, 2024

No. What Railgun did is it enabled the two sides of the connection to agree on a shared dictionary (the most recent version of the page being transmitted) and use that to compress the new page. It required both sides to keep a cache of page versions to compare against.

saagarjha · on March 6, 2024

Even putting aside CORS because I don’t even want to think about how this plays well with requests to another (tracking?) domain, this still doesn’t seem worth it. The explicit use case seems to be that it basically tells the server when you last visited the site based on which dictionary you have and then it gives you the moral equivalent of a delta update. Except, most browsers are working hard to expire data of this kind for privacy reasons. What’s the lifetime of these dictionaries going to be? I can see it being ok if it’s like 1 day but if this outlives how long cookies are stored it’s a significant privacy problem. The user visits the site again and essentially a cookie gets sent to the server? The page says “don’t put user-specific data in the request” but like nobody is stopping a website from doing this.

twotwotwo · on March 6, 2024

I think fingerprinting using this is mostly like the more direct ways to fingerprint with the cache, and the defenses against one are the defenses against the other.

For the cross-site thing, cache partitioning is the defense. If the cache of facebook.com/file is independent for a.com and b.com, Facebook can't link the visits.

An attacker using the hash of a cached resource as a pseudo-cookie could previously use the content of the resource as the pseudo-cookie. The Use-As-Dictionary wildcard allows cleverer implementations, but it seems like you can fingerprint for the same time period/in the same circumstances as before. In both cases you might do your tracking by ignoring how you're supposed to be using the feature; as you note, no one's stopping you.

Before and after the compression feature, it is true anti-tracking laws, etc. should address tracking with persistent storage in general not only cookies, much as they need to handle localStorage or other hiding places for data. Also true that for a browser to robustly defend against linking two visits to the same domain (or limit the possibility of tracking to a certain time period, session, origin, etc.), caching is one of the things it has to limit.

I think if they get the expiry, partitioning, etc. right (or wrong) for stopping cache fingerprinting, they also get it right (or wrong) for this.

I was admittedly a fan of the original SDCH that didn't take off, figuring that inter-resource redundancy is a thing. It's a neat spin on it to use the compression algo history windows instead of purpose-built diff tools, and use the existing cache of instead of a dictionary store to the side. Seems easier to implement on both ends compared to the previous try. I could see this being helpful for quickly kicking off page load, maybe especially for non-SPAs and imperfectly optimized sites that repeat a not-tiny header across loads.

hinkley · on March 6, 2024

I think I’d feel better with a fixed set of dictionaries based on a corpus that gets updated every year to match new patterns of traffic and specifications. Even if it’s less efficient.

pyrolistical · on March 6, 2024

Ya. Where is accept-encoding: zstandard-d-es2024

Where it encodes js files with a known dictionary that is ideal for es2024

hinkley · on March 6, 2024

And here’s one tuned for react, and one for svelte…

pyrolistical · on March 7, 2024

That wouldn’t make sense as it would be the user agent (aka your browser) that implements these shared dictionaries and they wouldn’t be able to add non-standard shared dictionaries for libs like react.

If they could do that then they might as well preload the cache with all common libs like react from well known cdn urls.

hinkley · on March 7, 2024

Committee decided set of dictionaries.

I never cared for react, but I know beyond a doubt that someone influential will ask for a dictionary tuned for it.

charcircuit · on March 6, 2024

Currently the max is temporarily capped at 30 days otherwise it would work as long as the dictionary is in the cache.

https://source.chromium.org/chromium/chromium/src/+/main:ser...

frankjr · on March 6, 2024

> Dictionary entries (or at least the metadata) should be cleared any time cookies are cleared.

So it seems it should not get you anything you cannot already do with cookies.

https://github.com/WICG/compression-dictionary-transport/blo...

twotwotwo · on March 6, 2024

It's interesting this is mentioned specifically about the metadata used by this feature: fingerprinting using this feature has similarities with other cache fingerprinting (wrote a sibling comment about that).

It's not actively bad to have defense-in-depth measures at the level of the dictionary feature. But if your implementation of dictionaries using your browser's existing cache policies is a privacy problem, I'd consider changing the cache, not just the shared-dictionary implementation.

patrickmeenan · on March 7, 2024

The dictionaries are partitioned by document and origin so a "tracking" domain will only be able to correlate requests within a given document origin and not across sites.

They are also cleared any time cookies are cleared and don't outlive what you can do today with cookies or Etags (and are using the most restrictive partitioning for that reason).

jauntywundrkind · on March 6, 2024

The Request For Position on Mozilla Zstd Support (2018) has a ton of interesting discussion on dictionaries. https://github.com/mozilla/standards-positions/issues/105

The original proposal for Zstd was to use a predefined stastically generated dictionary. Mozilla rejected the proposal for that.

But there's a lot of great discussion on what Zstd can do, whic.h is astoundingly flexible & powerful. There's discussion on dynamic adjustment if cinpression ratios. And discussion around shared dictionaries and their privacy implications. That Mozilla turned around & started supporting Zstd & has stamped a positive indicator, worth prototyping on shared dictionaries is a good initial stamp of approval to see! https://github.com/mozilla/standards-positions/issues/771

One of my main questions after reading this promising update is: how do pick what to include when generating custom dictionaries? Another comment mentions that brotli has a standard dictionary it uses, and that's some kind of possible starting place. But it feels like tools to build one's custom dictionary would be ideal.

patrickmeenan · on March 7, 2024

The brotli repo on github has a dictionary generator: https://github.com/google/brotli/blob/master/research/dictio...

I have a hosted version of it on https://use-as-dictionary.com/ to make it easier to experiment with.

eyelidlessness · on March 6, 2024

I agree with other comments concerned with fingerprinting, and it was my second thought reading through the article. But my first thought was how beneficial this could be for return visitors of a web app, and how it could similarly benefit related concerns, such as managing local caches for offline service workers.

True, for documents (as is another comment’s focus) this is perhaps overkill. Although even there, a benefit could be imagined for a large body of documents—it’s unclear whether this case is addressed, but it certainly could be with appropriate support across say preload links[0]. But if “the web is for documents, not apps” isn’t the proverbial hill you’re prepared to die on, this is a very compelling story for web apps.

I don’t know if it’s so compelling that it outweighs privacy implications, but I expect the other browser engines will have some good insights on that.

0: https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes...

patrickmeenan · on March 7, 2024

Even in the "documents" case of the web there can be pretty significant savings if users tend to visit more than one page and they share some amount of structure.

On the first entry to the site you trigger the load of an external dictionary that contains the common parts of the HTML across the site and then future document loads can be delta-compressed against the dictionary, effectively delivering just the page-specific bits.

You need to amortize the cost of loading the dictionary across the other page loads but it's usually pretty compelling once users visit more than 2-3 pages.

lukevp · on March 6, 2024

This seems so ludicrous to me when all we really need is a way to share a resource reference across sites. Like “I need react 18.1 on this page, and the SHA should be abcdefghi “. If you don’t have it, I can give it to you from my server, or you can follow this link to a CDN, but the resource itself can be deduplicated based on the hashed contents instead of the URI. Why isn’t this a thing when basically everything uses frameworks nowadays? This shared dictionary seems like a more obtuse and roundabout way to solve these. If there was caching by hashes, browsers could even preload the latest versions of new libraries before any sites even referenced them.

ColonelPhantom · on March 7, 2024

One potential issue is tracking. By sharing caches across websites it becomes possible to use timing attacks to track different users. This is why browsers are working to isolate caches per site: https://developer.chrome.com/blog/http-cache-partitioning

kevinventullo · on March 7, 2024

Couldn’t a dedicated actor use IP address or other fingerprinting techniques to do the same thing more easily?

EE84M3i · on March 7, 2024

Privacy issues.

You can use the presence of an item in the cache to correlate visits between sites.

miohat · on March 7, 2024

LocalCDN can hijack requests for common static components

matsemann · on March 6, 2024

How could a dictionary in the browsers that are pre-made with JS in mind fare? Aka instead of making a custom dictionary per resource I send to the user, I could say that "my scripts.js file uses the browser's built-in js-es2023-abc dictionary". So the browser's would have some dictionaries others could reuse.

What's the savings on that approach vs a gziped file without any dictionary?

saagarjha · on March 6, 2024

So Brotli already contains a dictionary that is trained on web traffic. I think the thing here is that Google wants to make sending YouTube 1.1 more efficient if you already have YouTube 1.0, but they can’t put YouTube 1.0 into the browser.

hinkley · on March 6, 2024

This is something game devs have been doing for decades.

If you want to delta 1.0 to 1.1 that’s server side work you do once at deployment or build time, not on every request.

Wingy · on March 6, 2024

What happens when you release 1.2 and someone who has 1.0 visits? Do you generate a delta for every past version at build time?

patrickmeenan · on March 7, 2024

You determine how far back you want to build deltas for. If you build deltas for the last 3 versions then you can send diffs for those users as well (as long as the dictionary hasn't expired). Or, you could just send the full response just like if dictionaries weren't supported.

Each site can decide what a "good" number of releases to build against based on typical release cycles and user visitation patterns.

burnhamup · on March 7, 2024

The patch system I worked with generated signatures of each build. The signature had the hash of each block of the build. The client has the signature for their version (1.0) and they download the signature of the new version (1.2) and diff the two. Then they download each block that has changed.

I think it was the `electron-updater` for my electron app, but I don't quite remember now.

hinkley · on March 7, 2024

It was just a couple years ago that I learned that the Unix compression libraries have a flag to make them “rsync friendly”. They do something with compression blocks to make them more stable across changes. Normally a small change in the middle of a file could change the rest of the output, due to bit packing.

I should really figure out how that works.

hinkley · on March 6, 2024

You mean if a user who hasn’t visited the site in a year comes back?

They download 1.2 because 1.0 is no longer in their browser cache, that’s what.

The web is easier than games because “files at rest” are much more volatile on the web.

ComputerGuru · on March 6, 2024

This seems like a possibly huge user/browser fingerprint. Yes, CORS has been taken into account, but for massive touch surface origins (Google, Facebook, doubleclick, etc) this certainly has concerning ramifications.

It’s also insanely complicated. All this effort, so many possible tuples of (shared dictionary, requested resource), none of which make sense to compress on-the-fly per-request, mean it’s specifically for the benefit of a select few sites.

When I saw the headline I thought that Chrome would ship with specific dictionaries (say one for js, one for css, etc) and advertise them and you could use the same server-side. But this is really convoluted.

wongarsu · on March 6, 2024

Don't want to set session cookies? Just provide user-specific compression dictionaries and use them as your session id! After all, how is the user supposed to notice they got a different dictionary than everyone else

hinkley · on March 6, 2024

Same problem with etags.

dspillett · on March 6, 2024

> I thought that Chrome would ship with specific dictionaries (say one for js, one for css, etc) and advertise them and you could use the same server-side. But this is really convoluted.

More convoluted, but I expect using an old version as the source for the dictionary will yield significantly better results than a generic dictionary for that type of file.

Of course it doesn't help the first load, which might be more noticeable than subsequent loads when not every object has been modified. Perhaps having a standard dictionary for each type for the first request and using a specific one when the old version if available, would give noticeable extra benefit for those first requests for minimal extra implementation effort.

strongpigeon · on March 6, 2024

> [...] mean it’s specifically for the benefit of a select few sites.

It does seem like the ones who benefit from this are large web application that often ship incremental changes. Which, to be fair are the ones that can use the most help.

This has the potential of moving the needle between: "the app takes 10 seconds to load" to "it loads instantly" for these scenarios. Say what you want about the fact that maybe they should optimize their stuff better, this does give them an easy out.

That being said, yeah this is really convoluted and does seem like a big fingerprinting surface.

falsandtru · on March 6, 2024

Doesn't the fact that resources send different data mean that SRI(Subresource Integrity) checks cannot be performed? As for fingerprinting, it would not be a problem since it is the same as with Etag.

https://developer.mozilla.org/en-US/docs/Web/Security/Subres...

charcircuit · on March 6, 2024

SRI hashes the decompressed resource

TacticalCoder · on March 6, 2024

> Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:

The savings are nice in the best case (like in TFA: switching from version 1.3.4 to 1.3.6 of a lib or whatever) but that Base64 encoded hash is not compressible and so this line basically adds 60+ bytes to the request.

Kinda ouch for when it's going to be a miss?

dspillett · on March 6, 2024

Maybe.

Though from the client side 60 bytes is likely not really noticeable¹ as a delay in the request send. Perhaps server side, which is seen many many client requests, will see an uptick in incoming bandwidth used, but in most cases servers responding to HTTP(S) requests see a lot more outgoing traffic (response sizes are much larger than requests sizes, on average), so have enough incoming bandwidth “spare” that it is not going to be saturated to the point where this has a significant effect.

--

[1] if the link is slow enough that several lots of 60 bytes is going to have much effect² it likely also has such high latency that the difference is dwarfed by the existing delays.

[2] a spotty GRPS connection? is anything slower than that in common use anywhere?

sethev · on March 6, 2024

If 60 bytes per request is a material overhead, then your workload is unlikely to benefit from general purpose compression of any kind.

pornel · on March 6, 2024

Upload is usually slower, more latency sensitive, and suffers from tcp cold start. Pages also make lots of small requests, so header overhead can add up. HTTP/2 added header compression for these reasons.

nevir · on March 6, 2024

What are the chances that the ~60 bytes are going to push the request over the frame size and end up splitting into another packet?

adrianmonk · on March 7, 2024

Aren't misses pretty preventable?

The only reason the client is even asking is that the server sent them a header saying it might be beneficial to do so.

And the client definitely has the dictionary data. The only thing it needs is for the server to accommodate the request after leading it down that path in the first place.

I can picture how it could happen, though. If you didn't realize the cost, you might not try to prevent misses. Or you could have a configuration error like sending the header but forgetting to generate pre-compressed data in your build.

If this is a significant issue, a server could collect stats and generate warnings about situations where it's not pulling its weight. Or even automatically disable it if hit rates are terrible.

tarasglek · on March 6, 2024

chrome team usually trials changes like this with extensive a/b testing via telemetry. Got to be a large overall win even with this.

sillysaurusx · on March 6, 2024

Clearly we’ll need to use a shared dictionary to compress this.

lozenge · on March 6, 2024

It might be compressible. HTTP/3 includes compression of request headers. Base64 doesn't use the top two bits in a byte so it's compressible.

ramses0 · on March 6, 2024

This plus native web-components is an incredible advance for "the web".

Fingerprinting concerns aside (compression == timing attacks in the general case), the fact that it's nearly network-transparent and framework/webserver compatible is incredible!

raggi · on March 6, 2024

What I really want: dictionaries derived from the standards and standard libraries (perhaps once a year or somesuch), which I'd use independently of build system gunk, and while it wouldn't be the tightest squeeze you can get, it would make my non-built assets get very close to built asset size for small to medium sized deployments.

IshKebab · on March 6, 2024

Ah damn I thought this was going to be available to JavaScript. Would be amazing for one use case I have (an HTML page containing inline logs from a load of commands, many of which are substantially similar).

patrickmeenan · on March 7, 2024

Maybe eventually (as a different spec). We've talked about wanting to support it in the DecompressionStream API or something similar at some point.

If you need it to be able to do compression though then it might be a harder sell since the browser doesn't ship with the compression code for zstd or brotli and would have to justify adding it.

IshKebab · on March 7, 2024

Yeah just decompression through `DecompressionStream` is exactly what I'd like.

jauntywundrkind · on March 6, 2024

That would be an excellent web standard!!

There's wasm modules that do similar but having it bakes into the browser could allow for further optimization than what's possible with wasm. https://github.com/bokuweb/zstd-wasm

I have no idea if it's possible but I wonder if a webgpu port could be made? Alternatively, for your use case, maybe you could try applying something like Basis Universal; a fast compression system for textures, that it seems there are some webgpu loaders for... Maybe that could be bent to encoding/deciding text?

netol · on March 6, 2024

The part I'm missing is how these dictionaries are created. Can I use the homepage to create my dictionary, so all other pages that share html are better efficiently compressed? How?

patrickmeenan · on March 7, 2024

For a delta update of one version of a resource to the next, the resource itself is the dictionary (i.e. JS file).

For stand-alone dictionaries, the brotli code on github has a dictionary_generator that you can use to generate a dictionary. You give it a dictionary size and a bunch of input files and it will generate one. I have a version of it hosted on https://use-as-dictionary.com/ that you can pass up to 100 URLs to and it will generate a dictionary for you (using the brotli tool).

netol · on March 7, 2024

Cool, thanks!

Sigliotio · on March 6, 2024

That should be used together with ML models.

Image compression for example or voice and video compression like what nvidia does.

But i do like this implementation focusing on libs, why not?

jwally · on March 6, 2024

Dumb question, but with respect to fingerprinting - how is this any worse than cookies, service workers, or localstorage?

skybrian · on March 6, 2024

I wonder if this would be a good alternative to minimizing JavaScript and having separate sourcemaps?

madeofpalk · on March 6, 2024

Not really.

Compressing JavaScript already gives you tonnes of benefits, but syntax-aware compression (modify js) gives you more.

Besides, this is a form of more efficient caching on that it only benefits subsequent visits.

kg · on March 6, 2024

JS minification will probably never die, because it makes parsing meaningfully faster.

adgjlsfhk1 · on March 6, 2024

the fact that the default on the web is to ship something that needs a parser is very silly.

kg · on March 6, 2024

Depending on how you look at it, Java, .NET and WebAssembly all need parsers too, they just happen to be parsing a binary format instead of text.

adgjlsfhk1 · on March 6, 2024

yes, and technically so does x86, but there's a pretty big difference between formats where the data is normalized and expected to be correct and formats that are intended for users and need to do things like name resolution and error checking. Parsing a language made for machines is easy to do faster than you can read the data from ram, while parsing a high level language will often happen at <100mbps

tsss · on March 6, 2024

This _screams_ sidechannel attack.

patrickmeenan · on March 7, 2024

How so? SDCH had sidechannel issues which is part of why it was unshipped. I don't know that someone won't find a way to attack it but the CORS requirement already requires that the dictionary and compressed-resource be readable and the dictionary has to be same-origin as the resources that it compresses.

Combined they mitigate the known dictionary-specific attack vectors.

kazinator · on March 6, 2024

With shared dictionaries you can compress everything down to under a byte.

Just put the to-be-compressed item into the shared dictionary, somehow distribute that to everyone, and then the compressed artifact consists of a reference to that item.

If the shared dictionary contains nothing else, it can just be a one-bit message whose meaning is "extract the one and only item out of the dictionary".

cuckatoo · on March 6, 2024

What stands out to me is that this creates another 'key' that the browser sends on every request which can be fingerprinted or tracked by the server.

I do not want my browser sending anything that looks like it could be used to uniquely identify me. Ever.

I want every request my browser makes to look like any other request made by another user's browser. I understand that this is what Google doesn't want but why can't they just be honest about it? Why come up with these elaborate lies?

Now to limit tracking exposure, in addition to running the AutoCookieDelete extension I'll have to go find some AutoDictionaryDelete extension to go with it. Boy am I glad the internet is getting better every day.

jsnell · on March 6, 2024

The obvious answer is that they are not lying.

You're making three assertions, none backed by any evidence. That this is a tracking vector, that it's primarily intended to be a tracking vector, and that they're lying about their motivations.

But your reasoning fails already at the first step, since you just assumed malice rather than do any research. This is not a useful tracking vector. The storage is partitioned by the top window, and it is cleared when cookies are cleared. It's also not really a new tracking vector, it's pretty much the same as ETags.