Personally I prefer gitlab. Just wanted to look at a bug I noted in their product this week (European weeks start on Monday :) ). Gave up because I couldn't find my way around, whether it's already reported. But at least with unlimited time I could probably do it.
Github raw seems like the simplest system for solving cache invalidation: invalidate the cache of a changed file when it’s pushed.
They have access to both GitHub and the raw service. I know there are usually all sorts of layers between that make interconnectivity logistically complicated, but am I wrong that at the top-level it’s that simple?
That is too simple for the feature they are using. The client itself has its own cache and the only way to fully prevent traffic from a client is to tell it content it caches will remain valid for some amount of time into the future.
For URLs that return the latest entry there is no valid amount of time known in advance by GitHub unless they want to introduce mandatory publication delays. For URLs of specific change sets, they should never be corrected again and an infinite cache is pretty much valid unless a user overrides good git practices.
I think GitHub frequently misidentifies which scenario they are in and when they return 1 day for a current state URL users notice, while when they return 5 minutes for a permanent change set that gets a lot if traffic they lost network capacity.
Tag it with etag = hash, done. The client side isn't hard part.
The server side would require pushing any invalidation to (I imagine) whole tree of caches, which isn't exactly that hard if you plan for it from the start and have some way of upstream telling downstream file changes, but, well, they probably don't as I'd imagine they didn't expected people to pin their infrastructure to some binary blob on github that mutates
Etag doesn't let you "fully prevent traffic from a client" (GP's exact words). They'll still send a request to which you need to reply with a 304 after checking the resource.
They wouldn’t even need to check the resource. If the hash (which you get for free since it’s part of the git commit) was the etag then they can quickly reply to that request from an edge.
You get it "for free" if you load up the repo and check some files. That's not free at all.
In fact, loading a file by name from a Git repo is rather expensive, and is definitely not the way their CDN should be keeping things in cache (gotta load ref, uncompress and parse commit object, uncompress and read tree object(s), just to get the blob's hash. Every one of those objects is both deflated and delta-encoded.)
Nono: don't do it at lookup, set the etag when the repo is pushed.
I'm open to the idea that it's less computation to simply hash the files instead of deflate and decide, but my point was the hash is already calculated on the client when the changed file is added to the repo.
Once you are using one style of caching its usually a mistake to introduce another even if the style is marginal. They very clearly have an issue with max age with clients/CDNs in the second half of the thread and probably have similar problems on internal transparent proxies, etc.
You could also just commit a file with the intended commit hash, make that the indicator for changes and use the commit in other requests. Has the added benefit that clients only need to fetch a tiny file if nothing changed.
Bounds checking as separate from off-by-one just means you stop using C arrays. That's not hard. And why point at callbacks specifically? And scope creep is not a computer science problem; it's easy to avoid if people decide to avoid it.
So this list is too bloated for the joke to work well, I think. Even before we talk about how off-by-one gets ruined this way.
Content addressing causes the name to change only when the content changes, which also means the name doesn't change if the content doesn't change, this by definition you don't have spurious cache invalidations
There are many problems with that though. For example, if your CSS changes you change the filename... but now you need to change the HTML file that references it. You can't change that easily.
Or what if your CSS change just deletes some unused classes... it'd be fine for users to keep the old version until it expires. If you rename the resource you'll be causing a lot of users to wait unnecessarily. Not a huge problem, unless you're Meta or Google.
And so on.
When people say cache invalidation is hard it's best to believe them, because it is.
But where client gets the new name it should download for?
As long as name client asks for (even if it is reference) is constant, it can have cache problems. Of course it makes it simpler as youc an opt to cache it so short that invalidation is less of an issue, but that's again working around cache invalidation
I agree this would provide for increased coherency, but the additional indirection is hardly ideal. And really it's just kicking the caching problem up to a super-object that will probably be thrashed much harder than any individual file, and may be larger than each individual file.
All in all it's a "certified hard problem", there's a bunch of domain specific nuance an HN thread couldn't hope to capture.
Yes, naming things is intimately tied to cache invalidation. That's why the two things are together in that maxim. Not sure why people think it has to so with naming variables or functions...
Sounds a little bit like the standard recipe of any functional programming enthusiast: Just use pure functions!
Except that they‘re colliding all the time with that tiny problem that the world isn‘t pure... People expect the same page at the same URL, with different content, tomorrow, so good luck choosing a different one than the one they have bookmarked and that ranks on Google.
Caching doesn‘t get any less hard by trying to define the problem away.
> Reading the comment threads on GitHub, some files get a TTL of 300, some get a TTL of 86400. The "why" is certainly an interesting question.
I would guess GitHub is slowly cutting down on people (ab)using it for free file hosting. Files that are hit a lot probably get significantly longer cache timeouts.
If the TTL starts at 86400 and then declines to 0 before resetting.. this is a fairly common caching strategy... it ensures the cache will expire for all clients at around the same time. For example, if you want the client's cache to expire at midnight everyday.
Source: Implemented global TTL in our own caching DNS in front of kube-dns (which is horrible if you, among other things, have node containers with no DNS caching; i still have a pcap with 20000+ queries for A in s3.amazonaws.com in a 0.2s span) before coredns was a thing.
The CPU spikes were huge, but remained hidden for a long time due to metrics resolution. But eventually it got bad enough that clients ended up not getting responses.
There are circumstances where that’s the right strategy. For example, GitHub may be using it to ensure two requests for two different files in a repo receive the same version of the repo.
Not saying they’re doing that.. just explaining the cache strategy.
An explanation isn’t a recommendation for you to go out and apply it to everything.
Source: founded and operated a cdn for 5 years of my life.
As long as you key the expiration to something, be that the client IP or the repository, you're probably fine. Having your entire customer base do their cache expiration request within the same small time span is not amazing though.
You can already get consistent views of raw.github by looking up the HEAD commit and requesting your files from that commit directly, though.
> Hi folks, GitHub engineer here. This was a bug with our caching in some cases - sorry about this! We're working on resolving this and the cache should be back to normal soon.
This seems reasonable to me. Caches should last longer; if you want to be sure of you get the latest version, rename the file. a good trick is to include the hash of the file in the name to get content-addressing à la IPFS.
I get the issue but I do strongly feel that you should expect http resources to be cached. It's so common of a thing expecting it to never be cached is unreasonable as a design.
I would expect the headers to make this clear, however.
You can use git or use the commit sha, you only need to name the files if you refuse to use the provided versioning and tooling and want GitHub to act as a file host.
> if you want to be sure of you get the latest version, rename the file.
This is a beyond ridiculous statement. It is a BUG that you do not get the latest version of the file when viewing raw, not an error you made that you should address by having a filename driven versioning system.
If I'm using Git and GitHub, it's specifically to NOT have to deal with v1, v1.1, final, final_for_real, final_of_the_finalest, this_time_its_really_final.
Your suggestion to work around this GitHub bug is to essentially not use Git. Ridiculous.
Ridiculous is expecting a service you're not paying for to serve files in a way that fits your use case when you've never entered into a contract that guarantees the behaviour you're relying on.
Well, I think it’s probably fair for users to expect a site advertising itself as a great web host for git to not serve stale files over web protocols. It’s kinda the entire point of the website.
Maybe it’s a bit questionable to use the raw feature as a content host, but GitHub has intentionally moved pretty fair from plain old git (I believe people call this a “moat”)
I’m sure there are plenty of other players competing for users that would be happy to solve the problem for free.
Who says I'm not paying for Github? But a Git porcelain that fails to show the version of a file it claims it's showing is a Git porcelain with a bug in it, regardless.
But the fact is you can face the problem for literally any GitHub usecase. I will often open a file in raw just to copy-paste a few lines for example, which is not any specific use case.
I hit this problem for my project. On every launch of my lib, it would grab a patches file from the main github repo. I was seeing people having to wait 5 minutes+ and multiple restarts to get the latest pushed file. The solution was to run a custom BunnyCDN instance where I can easily invalidate the caches via a basic API request which happens on a github push of that file.
I was surprised to see that sometimes that file was being requested millions of times a day due to attempts to loading it too early. With some optimizations I was able to greatly reduce that by 99%.
Having seen all those requests, I understand why github aggressively caches these files.
I agree that repo owners should be allowed to invalidate those caches upon an API request - or it should happen automatically on every change.
raw.github has seen some extensive (ab)use for bandwidth heavy applications and/or pointing a few thousand clients at it concurrently (including entire IoT fleets), without ever being communicated as your globally consistent free CDN.
I'm actually more amazed GitHub will give you a response in the 2xx range at all for these use-cases.
It doesn't matter what it was supposed to be used for. It's bad data in any context, even ones where you can't think of a way to blame the victim for holding it wrong.
It's not bad data though, it's just old and cached, which seems fine IMO even if the stale period is longer than you might expect.
It's not like it's serving the wrong content for a specific commit ref, it's when you request the file as on a branch, and that branch has recently changed.
Usually I am using permalink or a version containing link for my use cases, so they shouldn't change anyway and I am checking with a checksum. I think my use cases are not affected then.
GitHub is among many tech companies currently shedding fat from hiring during the pandemic boom. The morality of that is another discussion, but I'd hardly call this a death rattle.
Twitter fired many more engineers, and the reason for those firings was Elon's takeover and temper tantrums. Many more companies had firings this winter and things aren't on fire anywhere but Elon's Twitter.
If only GitHub's underlying technology Git had a way to trigger actions when a file is updated and they could use that to invalidate cache! One day maybe?
(Well, if we believe the statement "github engineer here". Of course every clown could write that, too)