GitHub is aggressively caching raw.github, breaking many use cases

usr1106 · on Feb 12, 2023

A bug: See https://github.com/orgs/community/discussions/46691#discussi...

(Well, if we believe the statement "github engineer here". Of course every clown could write that, too)

Kwpolska · on Feb 12, 2023

If you look at their profile[0], you can see they are a member of the GitHub organization and that they are marked as GitHub staff.

[0]: https://github.com/antn

usr1106 · on Feb 12, 2023

How do you see that? At leat in my mobile browser nothing jumps into my eye. This user has contributions to @github, but very few.

Edit: Found it. Need to click on one of the achievements. Then the layout changes and it appears in the lower left corner under Organizations.

TJSomething · on Feb 12, 2023

It doesn't appear to be visible on mobile, only desktop.

antn · on Feb 12, 2023

Ah, good catch. We'll try to see if we can fix that in the future so the mobile site shows our staff badges.

Vanit · on Feb 12, 2023

But how do we know this post is a GitHub engineer /ponders

Aeolun · on Feb 12, 2023

Github Mobile app has it too.

toastal · on Feb 12, 2023

[flagged]

usr1106 · on Feb 12, 2023

Personally I prefer gitlab. Just wanted to look at a bug I noted in their product this week (European weeks start on Monday :) ). Gave up because I couldn't find my way around, whether it's already reported. But at least with unlimited time I could probably do it.

TobTobXX · on Feb 12, 2023

There are two hard problems in IT: cache invalidation, naming things and off-by-one errors.

joshspankit · on Feb 12, 2023

Github raw seems like the simplest system for solving cache invalidation: invalidate the cache of a changed file when it’s pushed.

They have access to both GitHub and the raw service. I know there are usually all sorts of layers between that make interconnectivity logistically complicated, but am I wrong that at the top-level it’s that simple?

roundandround · on Feb 12, 2023

That is too simple for the feature they are using. The client itself has its own cache and the only way to fully prevent traffic from a client is to tell it content it caches will remain valid for some amount of time into the future.

For URLs that return the latest entry there is no valid amount of time known in advance by GitHub unless they want to introduce mandatory publication delays. For URLs of specific change sets, they should never be corrected again and an infinite cache is pretty much valid unless a user overrides good git practices.

I think GitHub frequently misidentifies which scenario they are in and when they return 1 day for a current state URL users notice, while when they return 5 minutes for a permanent change set that gets a lot if traffic they lost network capacity.

stingraycharles · on Feb 12, 2023

Isn’t this where stuff like ETAGs are supposed to help? Not completely solve, but at least help reduce the problem a bit more?

ignoramous · on Feb 12, 2023

Yes, but cache invalidation is a hard problem, so most services side-step etags (+ if-none-match), only to hit caches elsewhere.

ilyt · on Feb 12, 2023

Tag it with etag = hash, done. The client side isn't hard part.

The server side would require pushing any invalidation to (I imagine) whole tree of caches, which isn't exactly that hard if you plan for it from the start and have some way of upstream telling downstream file changes, but, well, they probably don't as I'd imagine they didn't expected people to pin their infrastructure to some binary blob on github that mutates

remram · on Feb 12, 2023

Etag doesn't let you "fully prevent traffic from a client" (GP's exact words). They'll still send a request to which you need to reply with a 304 after checking the resource.

joshspankit · on Feb 13, 2023

They wouldn’t even need to check the resource. If the hash (which you get for free since it’s part of the git commit) was the etag then they can quickly reply to that request from an edge.

remram · on Feb 13, 2023

You get it "for free" if you load up the repo and check some files. That's not free at all.

In fact, loading a file by name from a Git repo is rather expensive, and is definitely not the way their CDN should be keeping things in cache (gotta load ref, uncompress and parse commit object, uncompress and read tree object(s), just to get the blob's hash. Every one of those objects is both deflated and delta-encoded.)

joshspankit · on Feb 13, 2023

Nono: don't do it at lookup, set the etag when the repo is pushed.

I'm open to the idea that it's less computation to simply hash the files instead of deflate and decide, but my point was the hash is already calculated on the client when the changed file is added to the repo.

joshspankit · on Feb 12, 2023

> The client itself has its own cache

Let’s leave this aside since it both has known client-side mitigations, and is not the cause of the issue that was posted.

roundandround · on Feb 12, 2023

Once you are using one style of caching its usually a mistake to introduce another even if the style is marginal. They very clearly have an issue with max age with clients/CDNs in the second half of the thread and probably have similar problems on internal transparent proxies, etc.

asmor · on Feb 12, 2023

You could also just commit a file with the intended commit hash, make that the indicator for changes and use the commit in other requests. Has the added benefit that clients only need to fetch a tiny file if nothing changed.

hn_throwaway_99 · on Feb 12, 2023

GitHub responded that it was simply a bug in that Cache-Control was set to a day instead of 5 minutes. It's already been fixed.

airstrike · on Feb 12, 2023

There are actually only two hard problems in computer science:

0) Cache invalidation

1) Naming things

5) Asynchronous callbacks

2) Off-by-one errors

3) Scope creep

6) Bounds checking

raverbashing · on Feb 12, 2023

You forgot the 'Segmentation Fault' at the last line

wildcow · on Feb 12, 2023

It is on the list, but was never written to stdout because of a NULL pointer exception.

Dylan16807 · on Feb 12, 2023

Bounds checking as separate from off-by-one just means you stop using C arrays. That's not hard. And why point at callbacks specifically? And scope creep is not a computer science problem; it's easy to avoid if people decide to avoid it.

So this list is too bloated for the joke to work well, I think. Even before we talk about how off-by-one gets ruined this way.

marssaxman · on Feb 12, 2023

Haven't seen that variant before! Thanks.

ilyt · on Feb 12, 2023

asynchronous callbacks are more of "wrong solution" than hard problem...

latchkey · on Feb 12, 2023

7) Yoda logic

nhumrich · on Feb 12, 2023

And the oxford comma as well, apparently

ravi-delia · on Feb 12, 2023

This is not actually a situation where the Oxford comma disambiguates, though I also had the gut feeling one should be there

thfuran · on Feb 12, 2023

That's clearly a type of off by one error.

thih9 · on Feb 12, 2023

You multithreading have forgotten.

Retric · on Feb 12, 2023

Cache invalidation is a large class of multi threading problems, that’s just as much an issue between threads as between servers.

ruuda · on Feb 12, 2023

You can eliminate the cache invalidation problem by not reusing old names for new things.

onion2k · on Feb 12, 2023

That doesn't solve cache invalidation; that just means you're always invalidating the cache even in cases where you don't actually want to.

ithkuil · on Feb 12, 2023

Content addressing causes the name to change only when the content changes, which also means the name doesn't change if the content doesn't change, this by definition you don't have spurious cache invalidations

onion2k · on Feb 12, 2023

There are many problems with that though. For example, if your CSS changes you change the filename... but now you need to change the HTML file that references it. You can't change that easily.

Or what if your CSS change just deletes some unused classes... it'd be fine for users to keep the old version until it expires. If you rename the resource you'll be causing a lot of users to wait unnecessarily. Not a huge problem, unless you're Meta or Google.

And so on.

When people say cache invalidation is hard it's best to believe them, because it is.

ilyt · on Feb 12, 2023

But where client gets the new name it should download for?

As long as name client asks for (even if it is reference) is constant, it can have cache problems. Of course it makes it simpler as youc an opt to cache it so short that invalidation is less of an issue, but that's again working around cache invalidation

jakear · on Feb 12, 2023

And how do you check what the new name of the content you haven't seen is?

taeric · on Feb 12, 2023

Serious question? There will be a root object that has a stable name through updates. When you write that, you embed current content hashes in it.

Someone could get the cached of it, of course. But since it content addresses what it links to, it should avoid incoherent groups of cached data.

jakear · on Feb 12, 2023

I agree this would provide for increased coherency, but the additional indirection is hardly ideal. And really it's just kicking the caching problem up to a super-object that will probably be thrashed much harder than any individual file, and may be larger than each individual file.

All in all it's a "certified hard problem", there's a bunch of domain specific nuance an HN thread couldn't hope to capture.

taeric · on Feb 12, 2023

The extra indirection may or may not be a problem, of course. For a lot of places this is used, it is "index.html," after all.

That said, I do find folks bend way too many things into fitting this pattern; such that I do not mean to be dismissive of the criticism.

rightbyte · on Feb 12, 2023

You just solve that by using another level of indirection. Duh.

rightbyte · on Feb 12, 2023

Doesn't that just move the problems to more "naming things" though?

ithkuil · on Feb 12, 2023

Yes, naming things is intimately tied to cache invalidation. That's why the two things are together in that maxim. Not sure why people think it has to so with naming variables or functions...

codetrotter · on Feb 12, 2023

> Not sure why people think it has to so with naming variables or functions...

I think it’s open to interpretation. And I think it is valid to apply the saying to naming of variables, functions, and many other things.

See

https://skeptics.stackexchange.com/questions/19836/has-phil-...

Links to

https://www.karlton.org/2017/12/naming-things-hard/

Links to a bunch of places.

None of the ones I looked at seem to confidently say what was originally meant by the saying.

Either way, regardless of what the guy meant when he originally used to say it, it’s allowed to apply a saying to new situations.

CommitSyn · on Feb 12, 2023

> That's why the two things are together in that maxim.

I think you missed the joke.

onlypositive · on Feb 12, 2023

Did they?

CommitSyn · on Feb 15, 2023

> There are two hard problems in IT: cache invalidation, naming things and off-by-one errors.

That's three things with an off-by-one error.

ithkuil · on Feb 13, 2023

I _think_ it became a joke, accidentally.

endymi0n · on Feb 12, 2023

Sounds a little bit like the standard recipe of any functional programming enthusiast: Just use pure functions!

Except that they‘re colliding all the time with that tiny problem that the world isn‘t pure... People expect the same page at the same URL, with different content, tomorrow, so good luck choosing a different one than the one they have bookmarked and that ranks on Google.

Caching doesn‘t get any less hard by trying to define the problem away.

jonatron · on Feb 12, 2023

newester_new_final_v4_new.txt

throwanem · on Feb 12, 2023

Not using old names for new things in this case means abandoning the concept of Git branch tags.

Faaak · on Feb 12, 2023

The Oxford comma would be useful here

tmtvl · on Feb 12, 2023

The edit adding ", concurrency," and updating "two" to "three" arrived before the message, so it wasn't applied.

YourDadVPN · on Feb 12, 2023

You pretty much never have a list where "things" is somewhere other than the end.

InvaderFizz · on Feb 12, 2023

Reading the comment threads on GitHub, some files get a TTL of 300, some get a TTL of 86400. The "why" is certainly an interesting question.

the_mitsuhiko · on Feb 12, 2023

> Reading the comment threads on GitHub, some files get a TTL of 300, some get a TTL of 86400. The "why" is certainly an interesting question.

I would guess GitHub is slowly cutting down on people (ab)using it for free file hosting. Files that are hit a lot probably get significantly longer cache timeouts.

nkozyra · on Feb 12, 2023

I don't think this would impact this too much for free file hosting unless there's a use case where people are rewriting the same file very often.

With sounds more like standard git use case than file hosting abuse

rgbrenner · on Feb 12, 2023

If the TTL starts at 86400 and then declines to 0 before resetting.. this is a fairly common caching strategy... it ensures the cache will expire for all clients at around the same time. For example, if you want the client's cache to expire at midnight everyday.

asmor · on Feb 12, 2023

That's a terrible idea.

Source: Implemented global TTL in our own caching DNS in front of kube-dns (which is horrible if you, among other things, have node containers with no DNS caching; i still have a pcap with 20000+ queries for A in s3.amazonaws.com in a 0.2s span) before coredns was a thing.

The CPU spikes were huge, but remained hidden for a long time due to metrics resolution. But eventually it got bad enough that clients ended up not getting responses.

rgbrenner · on Feb 12, 2023

There are circumstances where that’s the right strategy. For example, GitHub may be using it to ensure two requests for two different files in a repo receive the same version of the repo.

Not saying they’re doing that.. just explaining the cache strategy.

An explanation isn’t a recommendation for you to go out and apply it to everything.

Source: founded and operated a cdn for 5 years of my life.

asmor · on Feb 12, 2023

As long as you key the expiration to something, be that the client IP or the repository, you're probably fine. Having your entire customer base do their cache expiration request within the same small time span is not amazing though.

You can already get consistent views of raw.github by looking up the HEAD commit and requesting your files from that commit directly, though.

adastra22 · on Feb 12, 2023

Is that really common? That sounds like a recipe for disaster.

doctor_eval · on Feb 12, 2023

Yeah, sounds like a self-inflicted variant of the Thundering Herd Problem

https://en.wikipedia.org/wiki/Thundering_herd_problem

Kinrany · on Feb 12, 2023

You could use the same mechanism to spread out the load

miohtama · on Feb 12, 2023

> Hi folks, GitHub engineer here. This was a bug with our caching in some cases - sorry about this! We're working on resolving this and the cache should be back to normal soon.

Laaas · on Feb 12, 2023

This seems reasonable to me. Caches should last longer; if you want to be sure of you get the latest version, rename the file. a good trick is to include the hash of the file in the name to get content-addressing à la IPFS.

mihaaly · on Feb 12, 2023

I thought the content is more important than getting whatever quickly...

Rename the file to get what you want, am I the only one finding that a very strange approach?

IanCal · on Feb 12, 2023

The commit ID seems to be part of the URL you can use

https://raw.githubusercontent.com/burekasKodi/repository.bur...

I'm fine with them caching "latest" tbh.

Aeolun · on Feb 12, 2023

I’m not fine with latest not actually being latest. That defeats the point of the URL.

IanCal · on Feb 12, 2023

I get the issue but I do strongly feel that you should expect http resources to be cached. It's so common of a thing expecting it to never be cached is unreasonable as a design.

I would expect the headers to make this clear, however.

Aeolun · on Feb 13, 2023

I actually don’t. Git already has a file on the filesystem. Serving a plain file is as close to caching as you are going to get.

8note · on Feb 12, 2023

There's a hard limit on that - the speed of light.

You can't know what the actual latest is, only some cached value. The actual value may have changed while the message is in flight

xboxnolifes · on Feb 12, 2023

Guess we may as well throw our handa up in the air then and set a 14 year cache ttl.

ahepp · on Feb 12, 2023

Seeing as it’s git, can’t you just do content based hashing through the commit hash?

And I’d assume the GitHub api has a way to get the hash for the head of a branch?

I don’t know, I guess the entire point of GitHub is to be able to obtain up-to-date files, so maybe they should just improve the caching.

croes · on Feb 12, 2023

So we are back to file_v1, file_v1.1, file_v1.2 etc?

9dev · on Feb 12, 2023

No, we are where we should have always been:

    > GET /file/latest HTTP/1.1
    < HTTP/1.1 302 Found
    < Location /path/to/real/file/hash.tar.gz

akerl_ · on Feb 12, 2023

Sure... but their CDN is caching the results for /file/latest, so they'd either have to handle cache expiry differently or the same bug would happen.

wielebny · on Feb 12, 2023

Per RFC 2616:

> This response is only cacheable if indicated by a Cache-Control or Expires header field.

nix23 · on Feb 12, 2023

Like OpenVMS makes it on Filesystem level where the highest version is the "real" one:

Blupblub;1 Blupblub;2 --> Blupblub;2 = Blupblub

Actually a good idea ;)

jack_pp · on Feb 12, 2023

Nah, file_v1_final_final_REALLY_final

_kwef · on Feb 12, 2023

So we need to use file_dec_2, file_nov_2, the sort of hacks version control systems were meant to replace?

IanCal · on Feb 12, 2023

You can use git or use the commit sha, you only need to name the files if you refuse to use the provided versioning and tooling and want GitHub to act as a file host.

iLoveOncall · on Feb 12, 2023

> if you want to be sure of you get the latest version, rename the file.

This is a beyond ridiculous statement. It is a BUG that you do not get the latest version of the file when viewing raw, not an error you made that you should address by having a filename driven versioning system.

If I'm using Git and GitHub, it's specifically to NOT have to deal with v1, v1.1, final, final_for_real, final_of_the_finalest, this_time_its_really_final.

Your suggestion to work around this GitHub bug is to essentially not use Git. Ridiculous.

EdwardDiego · on Feb 12, 2023

Ridiculous is expecting a service you're not paying for to serve files in a way that fits your use case when you've never entered into a contract that guarantees the behaviour you're relying on.

You can use Git just fine without GitHub.

ahepp · on Feb 12, 2023

Well, I think it’s probably fair for users to expect a site advertising itself as a great web host for git to not serve stale files over web protocols. It’s kinda the entire point of the website.

Maybe it’s a bit questionable to use the raw feature as a content host, but GitHub has intentionally moved pretty fair from plain old git (I believe people call this a “moat”)

I’m sure there are plenty of other players competing for users that would be happy to solve the problem for free.

throwanem · on Feb 12, 2023

Who says I'm not paying for Github? But a Git porcelain that fails to show the version of a file it claims it's showing is a Git porcelain with a bug in it, regardless.

You don't have to take my word for it! https://github.com/orgs/community/discussions/46691#discussi...

iLoveOncall · on Feb 12, 2023

So expecting that a raw file isn't outdated is a special usecase now?

And I don't see where you see that I'm not a paying user? This issue affects every repository, including ones where the user pays.

8note · on Feb 12, 2023

The CAP theorem certainly isn't new, and it's unreasonable to think GitHub has solved it

EdwardDiego · on Feb 12, 2023

Welp, is it in your SLA?

KptMarchewa · on Feb 12, 2023

If you're using GitHub for this reason, you also have protocol that specifically works for that use case.

iLoveOncall · on Feb 12, 2023

But the fact is you can face the problem for literally any GitHub usecase. I will often open a file in raw just to copy-paste a few lines for example, which is not any specific use case.

IanCal · on Feb 12, 2023

No, using git works just fine, as does specifically referencing the version want.

wildcow · on Feb 12, 2023

People should stop using GitHub as a CDN and just clone their ad whitelist.

Symbiote · on Feb 12, 2023

Github say the problem will soon be fixed.

https://github.com/orgs/community/discussions/46691#discussi...

mihaaly · on Feb 12, 2023

The old Intel vs. Motorola joke comes to mind from the era of floating point bug in Pentium.

  Intel and Motorola processors compete:
  How much is 2 + 2? - asks the Motorola
  5!
  Wrong answer.
  But I was quick wasn't I?!

smashah · on Feb 12, 2023

I hit this problem for my project. On every launch of my lib, it would grab a patches file from the main github repo. I was seeing people having to wait 5 minutes+ and multiple restarts to get the latest pushed file. The solution was to run a custom BunnyCDN instance where I can easily invalidate the caches via a basic API request which happens on a github push of that file.

I was surprised to see that sometimes that file was being requested millions of times a day due to attempts to loading it too early. With some optimizations I was able to greatly reduce that by 99%.

Having seen all those requests, I understand why github aggressively caches these files.

I agree that repo owners should be allowed to invalidate those caches upon an API request - or it should happen automatically on every change.

pjc50 · on Feb 12, 2023

> On every launch of my lib, it would grab a patches file from the main github repo.

Was this a project that interacts with github otherwise? Because this seems very .. brittle.

smashah · on Feb 14, 2023

Yes of course it is wholly based in GitHub.

kzrdude · on Feb 12, 2023

What level of service should we expect from github? Many of its services are free, can we ask for more when not paying?

asmor · on Feb 12, 2023

raw.github has seen some extensive (ab)use for bandwidth heavy applications and/or pointing a few thousand clients at it concurrently (including entire IoT fleets), without ever being communicated as your globally consistent free CDN.

I'm actually more amazed GitHub will give you a response in the 2xx range at all for these use-cases.

sourcecodeplz · on Feb 12, 2023

This. I know I've used it as a static file hosting service before. The cache problem was real then too. Can't complain though.

pjmlp · on Feb 12, 2023

What about having their own servers instead of misusing Github?

brundolf · on Feb 12, 2023

Fixed: https://github.com/orgs/community/discussions/46691#discussi...

homero · on Feb 12, 2023

I don't think raw was supposed to be used as your CDN

Brian_K_White · on Feb 12, 2023

It doesn't matter what it was supposed to be used for. It's bad data in any context, even ones where you can't think of a way to blame the victim for holding it wrong.

OJFord · on Feb 12, 2023

It's not bad data though, it's just old and cached, which seems fine IMO even if the stale period is longer than you might expect.

It's not like it's serving the wrong content for a specific commit ref, it's when you request the file as on a branch, and that branch has recently changed.

amadeuspagel · on Feb 12, 2023

I'm guessing that a lot of people embed these files in their website in one way or another, making aggressive caching necessary.

nikeee · on Feb 12, 2023

GitHub recommends using GH Pages for that. Raw links cause a higher server load and aren't even being served with the correct mime type.

kristofferR · on Feb 12, 2023

I discovered the issue when I reported an adblock filter list that broke a site.

The issue was fixed within minutes, but the broken filter list is still being served.

Thankfully, it turned out to be a bug.

zelphirkalt · on Feb 12, 2023

Usually I am using permalink or a version containing link for my use cases, so they shouldn't change anyway and I am checking with a checksum. I think my use cases are not affected then.

madewulf · on Feb 12, 2023

GitHub caching can be a pain.

It's been for example impossible to make files completely disappear from an open source repo without deleting the repo or contacting their support.

Symbiote · on Feb 12, 2023

That's Git garbage collection, not the caching discussed in the post.

https://github.blog/2022-09-13-scaling-gits-garbage-collecti...

dcchambers · on Feb 12, 2023

Could this not be resolved by using the file version from a specific git tag instead of using the main branch?

amerine · on Feb 12, 2023

Not asking this to distract from the problem but can you link to the raw version of a specific commit/ref?

quickthrower2 · on Feb 12, 2023

Yes. E.g.:

https://raw.githubusercontent.com/community/community/a86a03...

nubinetwork · on Feb 12, 2023

Wouldn't an aggressive cache mean it has the most recent content, not the other way around?

low_tech_punk · on Feb 12, 2023

You mean aggressive cache eviction/invalidation? I think they might be taking about aggressive caching. Subtle difference in words.

Aeolun · on Feb 12, 2023

This sounds like a bug, not something they deliberately do.

smrtinsert · on Feb 12, 2023

Why wouldn't it be cached? After a push/merge shouldn't it be static until the next one?

remram · on Feb 12, 2023

There's is no way to forecast when the "next one" is going to be and tell the client to cache for that long.

joshxyz · on Feb 12, 2023

Isnt the Cache-Control header the solution here? 304 if etag is cached otherwise return the new with 200

heavyset_go · on Feb 12, 2023

Here I thought I was experiencing this because I was doing something wrong.

atdixon · on Feb 12, 2023

Can you just add a random query param value to bust the cache?

drstewart · on Feb 12, 2023

[flagged]

bratbag · on Feb 12, 2023

Has github fired most people capable of fixing problems?

drstewart · on Feb 12, 2023

Github did layoffs, yes.

We'll see how much longer the site can stay up. This is a death rattle.

soulofmischief · on Feb 12, 2023

GitHub is among many tech companies currently shedding fat from hiring during the pandemic boom. The morality of that is another discussion, but I'd hardly call this a death rattle.

drstewart · on Feb 12, 2023

Let's see you apply that same logic in this thread:

https://news.ycombinator.com/item?id=34715890

Kwpolska · on Feb 12, 2023

Twitter fired many more engineers, and the reason for those firings was Elon's takeover and temper tantrums. Many more companies had firings this winter and things aren't on fire anywhere but Elon's Twitter.

bdhcuidbebe · on Feb 12, 2023

Lol, what a hilariously unfounded coclusion.

Tempest1981 · on Feb 12, 2023

I read that with a /s

iLoveOncall · on Feb 12, 2023

If only GitHub's underlying technology Git had a way to trigger actions when a file is updated and they could use that to invalidate cache! One day maybe?

remram · on Feb 12, 2023

Invalidate browser's cache?

iLoveOncall · on Feb 12, 2023

This is not the problem as adding query parameters to bypass that doesn't work.