Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GitHub is aggressively caching raw.github, breaking many use cases (github.com/orgs)
168 points by kristofferR on Feb 12, 2023 | hide | past | favorite | 136 comments


A bug: See https://github.com/orgs/community/discussions/46691#discussi...

(Well, if we believe the statement "github engineer here". Of course every clown could write that, too)


If you look at their profile[0], you can see they are a member of the GitHub organization and that they are marked as GitHub staff.

[0]: https://github.com/antn


How do you see that? At leat in my mobile browser nothing jumps into my eye. This user has contributions to @github, but very few.

Edit: Found it. Need to click on one of the achievements. Then the layout changes and it appears in the lower left corner under Organizations.


It doesn't appear to be visible on mobile, only desktop.


Ah, good catch. We'll try to see if we can fix that in the future so the mobile site shows our staff badges.


But how do we know this post is a GitHub engineer /ponders


Github Mobile app has it too.


[flagged]


Personally I prefer gitlab. Just wanted to look at a bug I noted in their product this week (European weeks start on Monday :) ). Gave up because I couldn't find my way around, whether it's already reported. But at least with unlimited time I could probably do it.


There are two hard problems in IT: cache invalidation, naming things and off-by-one errors.


Github raw seems like the simplest system for solving cache invalidation: invalidate the cache of a changed file when it’s pushed.

They have access to both GitHub and the raw service. I know there are usually all sorts of layers between that make interconnectivity logistically complicated, but am I wrong that at the top-level it’s that simple?


That is too simple for the feature they are using. The client itself has its own cache and the only way to fully prevent traffic from a client is to tell it content it caches will remain valid for some amount of time into the future.

For URLs that return the latest entry there is no valid amount of time known in advance by GitHub unless they want to introduce mandatory publication delays. For URLs of specific change sets, they should never be corrected again and an infinite cache is pretty much valid unless a user overrides good git practices.

I think GitHub frequently misidentifies which scenario they are in and when they return 1 day for a current state URL users notice, while when they return 5 minutes for a permanent change set that gets a lot if traffic they lost network capacity.


Isn’t this where stuff like ETAGs are supposed to help? Not completely solve, but at least help reduce the problem a bit more?


Yes, but cache invalidation is a hard problem, so most services side-step etags (+ if-none-match), only to hit caches elsewhere.


Tag it with etag = hash, done. The client side isn't hard part.

The server side would require pushing any invalidation to (I imagine) whole tree of caches, which isn't exactly that hard if you plan for it from the start and have some way of upstream telling downstream file changes, but, well, they probably don't as I'd imagine they didn't expected people to pin their infrastructure to some binary blob on github that mutates


Etag doesn't let you "fully prevent traffic from a client" (GP's exact words). They'll still send a request to which you need to reply with a 304 after checking the resource.


They wouldn’t even need to check the resource. If the hash (which you get for free since it’s part of the git commit) was the etag then they can quickly reply to that request from an edge.


You get it "for free" if you load up the repo and check some files. That's not free at all.

In fact, loading a file by name from a Git repo is rather expensive, and is definitely not the way their CDN should be keeping things in cache (gotta load ref, uncompress and parse commit object, uncompress and read tree object(s), just to get the blob's hash. Every one of those objects is both deflated and delta-encoded.)


Nono: don't do it at lookup, set the etag when the repo is pushed.

I'm open to the idea that it's less computation to simply hash the files instead of deflate and decide, but my point was the hash is already calculated on the client when the changed file is added to the repo.


> The client itself has its own cache

Let’s leave this aside since it both has known client-side mitigations, and is not the cause of the issue that was posted.


Once you are using one style of caching its usually a mistake to introduce another even if the style is marginal. They very clearly have an issue with max age with clients/CDNs in the second half of the thread and probably have similar problems on internal transparent proxies, etc.


You could also just commit a file with the intended commit hash, make that the indicator for changes and use the commit in other requests. Has the added benefit that clients only need to fetch a tiny file if nothing changed.


GitHub responded that it was simply a bug in that Cache-Control was set to a day instead of 5 minutes. It's already been fixed.


There are actually only two hard problems in computer science:

0) Cache invalidation

1) Naming things

5) Asynchronous callbacks

2) Off-by-one errors

3) Scope creep

6) Bounds checking


You forgot the 'Segmentation Fault' at the last line


It is on the list, but was never written to stdout because of a NULL pointer exception.


Bounds checking as separate from off-by-one just means you stop using C arrays. That's not hard. And why point at callbacks specifically? And scope creep is not a computer science problem; it's easy to avoid if people decide to avoid it.

So this list is too bloated for the joke to work well, I think. Even before we talk about how off-by-one gets ruined this way.


Haven't seen that variant before! Thanks.


asynchronous callbacks are more of "wrong solution" than hard problem...


7) Yoda logic


And the oxford comma as well, apparently


This is not actually a situation where the Oxford comma disambiguates, though I also had the gut feeling one should be there


That's clearly a type of off by one error.


You multithreading have forgotten.


Cache invalidation is a large class of multi threading problems, that’s just as much an issue between threads as between servers.


You can eliminate the cache invalidation problem by not reusing old names for new things.


That doesn't solve cache invalidation; that just means you're always invalidating the cache even in cases where you don't actually want to.


Content addressing causes the name to change only when the content changes, which also means the name doesn't change if the content doesn't change, this by definition you don't have spurious cache invalidations


There are many problems with that though. For example, if your CSS changes you change the filename... but now you need to change the HTML file that references it. You can't change that easily.

Or what if your CSS change just deletes some unused classes... it'd be fine for users to keep the old version until it expires. If you rename the resource you'll be causing a lot of users to wait unnecessarily. Not a huge problem, unless you're Meta or Google.

And so on.

When people say cache invalidation is hard it's best to believe them, because it is.


But where client gets the new name it should download for?

As long as name client asks for (even if it is reference) is constant, it can have cache problems. Of course it makes it simpler as youc an opt to cache it so short that invalidation is less of an issue, but that's again working around cache invalidation


And how do you check what the new name of the content you haven't seen is?


Serious question? There will be a root object that has a stable name through updates. When you write that, you embed current content hashes in it.

Someone could get the cached of it, of course. But since it content addresses what it links to, it should avoid incoherent groups of cached data.


I agree this would provide for increased coherency, but the additional indirection is hardly ideal. And really it's just kicking the caching problem up to a super-object that will probably be thrashed much harder than any individual file, and may be larger than each individual file.

All in all it's a "certified hard problem", there's a bunch of domain specific nuance an HN thread couldn't hope to capture.


The extra indirection may or may not be a problem, of course. For a lot of places this is used, it is "index.html," after all.

That said, I do find folks bend way too many things into fitting this pattern; such that I do not mean to be dismissive of the criticism.


You just solve that by using another level of indirection. Duh.


Doesn't that just move the problems to more "naming things" though?


Yes, naming things is intimately tied to cache invalidation. That's why the two things are together in that maxim. Not sure why people think it has to so with naming variables or functions...


> Not sure why people think it has to so with naming variables or functions...

I think it’s open to interpretation. And I think it is valid to apply the saying to naming of variables, functions, and many other things.

See

https://skeptics.stackexchange.com/questions/19836/has-phil-...

Links to

https://www.karlton.org/2017/12/naming-things-hard/

Links to a bunch of places.

None of the ones I looked at seem to confidently say what was originally meant by the saying.

Either way, regardless of what the guy meant when he originally used to say it, it’s allowed to apply a saying to new situations.


> That's why the two things are together in that maxim.

I think you missed the joke.


Did they?


> There are two hard problems in IT: cache invalidation, naming things and off-by-one errors.

That's three things with an off-by-one error.


I _think_ it became a joke, accidentally.


Sounds a little bit like the standard recipe of any functional programming enthusiast: Just use pure functions!

Except that they‘re colliding all the time with that tiny problem that the world isn‘t pure... People expect the same page at the same URL, with different content, tomorrow, so good luck choosing a different one than the one they have bookmarked and that ranks on Google.

Caching doesn‘t get any less hard by trying to define the problem away.


newester_new_final_v4_new.txt


Not using old names for new things in this case means abandoning the concept of Git branch tags.


The Oxford comma would be useful here


The edit adding ", concurrency," and updating "two" to "three" arrived before the message, so it wasn't applied.


You pretty much never have a list where "things" is somewhere other than the end.


Reading the comment threads on GitHub, some files get a TTL of 300, some get a TTL of 86400. The "why" is certainly an interesting question.


> Reading the comment threads on GitHub, some files get a TTL of 300, some get a TTL of 86400. The "why" is certainly an interesting question.

I would guess GitHub is slowly cutting down on people (ab)using it for free file hosting. Files that are hit a lot probably get significantly longer cache timeouts.


I don't think this would impact this too much for free file hosting unless there's a use case where people are rewriting the same file very often.

With sounds more like standard git use case than file hosting abuse


If the TTL starts at 86400 and then declines to 0 before resetting.. this is a fairly common caching strategy... it ensures the cache will expire for all clients at around the same time. For example, if you want the client's cache to expire at midnight everyday.


That's a terrible idea.

Source: Implemented global TTL in our own caching DNS in front of kube-dns (which is horrible if you, among other things, have node containers with no DNS caching; i still have a pcap with 20000+ queries for A in s3.amazonaws.com in a 0.2s span) before coredns was a thing.

The CPU spikes were huge, but remained hidden for a long time due to metrics resolution. But eventually it got bad enough that clients ended up not getting responses.


There are circumstances where that’s the right strategy. For example, GitHub may be using it to ensure two requests for two different files in a repo receive the same version of the repo.

Not saying they’re doing that.. just explaining the cache strategy.

An explanation isn’t a recommendation for you to go out and apply it to everything.

Source: founded and operated a cdn for 5 years of my life.


As long as you key the expiration to something, be that the client IP or the repository, you're probably fine. Having your entire customer base do their cache expiration request within the same small time span is not amazing though.

You can already get consistent views of raw.github by looking up the HEAD commit and requesting your files from that commit directly, though.


Is that really common? That sounds like a recipe for disaster.


Yeah, sounds like a self-inflicted variant of the Thundering Herd Problem

https://en.wikipedia.org/wiki/Thundering_herd_problem


You could use the same mechanism to spread out the load


> Hi folks, GitHub engineer here. This was a bug with our caching in some cases - sorry about this! We're working on resolving this and the cache should be back to normal soon.


This seems reasonable to me. Caches should last longer; if you want to be sure of you get the latest version, rename the file. a good trick is to include the hash of the file in the name to get content-addressing à la IPFS.


I thought the content is more important than getting whatever quickly...

Rename the file to get what you want, am I the only one finding that a very strange approach?


The commit ID seems to be part of the URL you can use

https://raw.githubusercontent.com/burekasKodi/repository.bur...

I'm fine with them caching "latest" tbh.


I’m not fine with latest not actually being latest. That defeats the point of the URL.


I get the issue but I do strongly feel that you should expect http resources to be cached. It's so common of a thing expecting it to never be cached is unreasonable as a design.

I would expect the headers to make this clear, however.


I actually don’t. Git already has a file on the filesystem. Serving a plain file is as close to caching as you are going to get.


There's a hard limit on that - the speed of light.

You can't know what the actual latest is, only some cached value. The actual value may have changed while the message is in flight


Guess we may as well throw our handa up in the air then and set a 14 year cache ttl.


Seeing as it’s git, can’t you just do content based hashing through the commit hash?

And I’d assume the GitHub api has a way to get the hash for the head of a branch?

I don’t know, I guess the entire point of GitHub is to be able to obtain up-to-date files, so maybe they should just improve the caching.


So we are back to file_v1, file_v1.1, file_v1.2 etc?


No, we are where we should have always been:

    > GET /file/latest HTTP/1.1
    < HTTP/1.1 302 Found
    < Location /path/to/real/file/hash.tar.gz


Sure... but their CDN is caching the results for /file/latest, so they'd either have to handle cache expiry differently or the same bug would happen.


Per RFC 2616:

> This response is only cacheable if indicated by a Cache-Control or Expires header field.


Like OpenVMS makes it on Filesystem level where the highest version is the "real" one:

Blupblub;1 Blupblub;2 --> Blupblub;2 = Blupblub

Actually a good idea ;)


Nah, file_v1_final_final_REALLY_final


So we need to use file_dec_2, file_nov_2, the sort of hacks version control systems were meant to replace?


You can use git or use the commit sha, you only need to name the files if you refuse to use the provided versioning and tooling and want GitHub to act as a file host.


> if you want to be sure of you get the latest version, rename the file.

This is a beyond ridiculous statement. It is a BUG that you do not get the latest version of the file when viewing raw, not an error you made that you should address by having a filename driven versioning system.

If I'm using Git and GitHub, it's specifically to NOT have to deal with v1, v1.1, final, final_for_real, final_of_the_finalest, this_time_its_really_final.

Your suggestion to work around this GitHub bug is to essentially not use Git. Ridiculous.


Ridiculous is expecting a service you're not paying for to serve files in a way that fits your use case when you've never entered into a contract that guarantees the behaviour you're relying on.

You can use Git just fine without GitHub.


Well, I think it’s probably fair for users to expect a site advertising itself as a great web host for git to not serve stale files over web protocols. It’s kinda the entire point of the website.

Maybe it’s a bit questionable to use the raw feature as a content host, but GitHub has intentionally moved pretty fair from plain old git (I believe people call this a “moat”)

I’m sure there are plenty of other players competing for users that would be happy to solve the problem for free.


Who says I'm not paying for Github? But a Git porcelain that fails to show the version of a file it claims it's showing is a Git porcelain with a bug in it, regardless.

You don't have to take my word for it! https://github.com/orgs/community/discussions/46691#discussi...


So expecting that a raw file isn't outdated is a special usecase now?

And I don't see where you see that I'm not a paying user? This issue affects every repository, including ones where the user pays.


The CAP theorem certainly isn't new, and it's unreasonable to think GitHub has solved it


Welp, is it in your SLA?


If you're using GitHub for this reason, you also have protocol that specifically works for that use case.


But the fact is you can face the problem for literally any GitHub usecase. I will often open a file in raw just to copy-paste a few lines for example, which is not any specific use case.


No, using git works just fine, as does specifically referencing the version want.


People should stop using GitHub as a CDN and just clone their ad whitelist.


Github say the problem will soon be fixed.

https://github.com/orgs/community/discussions/46691#discussi...


The old Intel vs. Motorola joke comes to mind from the era of floating point bug in Pentium.

  Intel and Motorola processors compete:
  How much is 2 + 2? - asks the Motorola
  5!
  Wrong answer.
  But I was quick wasn't I?!


I hit this problem for my project. On every launch of my lib, it would grab a patches file from the main github repo. I was seeing people having to wait 5 minutes+ and multiple restarts to get the latest pushed file. The solution was to run a custom BunnyCDN instance where I can easily invalidate the caches via a basic API request which happens on a github push of that file.

I was surprised to see that sometimes that file was being requested millions of times a day due to attempts to loading it too early. With some optimizations I was able to greatly reduce that by 99%.

Having seen all those requests, I understand why github aggressively caches these files.

I agree that repo owners should be allowed to invalidate those caches upon an API request - or it should happen automatically on every change.


> On every launch of my lib, it would grab a patches file from the main github repo.

Was this a project that interacts with github otherwise? Because this seems very .. brittle.


Yes of course it is wholly based in GitHub.


What level of service should we expect from github? Many of its services are free, can we ask for more when not paying?


raw.github has seen some extensive (ab)use for bandwidth heavy applications and/or pointing a few thousand clients at it concurrently (including entire IoT fleets), without ever being communicated as your globally consistent free CDN.

I'm actually more amazed GitHub will give you a response in the 2xx range at all for these use-cases.


This. I know I've used it as a static file hosting service before. The cache problem was real then too. Can't complain though.


What about having their own servers instead of misusing Github?



I don't think raw was supposed to be used as your CDN


It doesn't matter what it was supposed to be used for. It's bad data in any context, even ones where you can't think of a way to blame the victim for holding it wrong.


It's not bad data though, it's just old and cached, which seems fine IMO even if the stale period is longer than you might expect.

It's not like it's serving the wrong content for a specific commit ref, it's when you request the file as on a branch, and that branch has recently changed.


I'm guessing that a lot of people embed these files in their website in one way or another, making aggressive caching necessary.


GitHub recommends using GH Pages for that. Raw links cause a higher server load and aren't even being served with the correct mime type.


I discovered the issue when I reported an adblock filter list that broke a site.

The issue was fixed within minutes, but the broken filter list is still being served.

Thankfully, it turned out to be a bug.


Usually I am using permalink or a version containing link for my use cases, so they shouldn't change anyway and I am checking with a checksum. I think my use cases are not affected then.


GitHub caching can be a pain.

It's been for example impossible to make files completely disappear from an open source repo without deleting the repo or contacting their support.


That's Git garbage collection, not the caching discussed in the post.

https://github.blog/2022-09-13-scaling-gits-garbage-collecti...


Could this not be resolved by using the file version from a specific git tag instead of using the main branch?


Not asking this to distract from the problem but can you link to the raw version of a specific commit/ref?



Wouldn't an aggressive cache mean it has the most recent content, not the other way around?


You mean aggressive cache eviction/invalidation? I think they might be taking about aggressive caching. Subtle difference in words.


This sounds like a bug, not something they deliberately do.


Why wouldn't it be cached? After a push/merge shouldn't it be static until the next one?


There's is no way to forecast when the "next one" is going to be and tell the client to cache for that long.


Isnt the Cache-Control header the solution here? 304 if etag is cached otherwise return the new with 200


Here I thought I was experiencing this because I was doing something wrong.


Can you just add a random query param value to bust the cache?


[flagged]


Has github fired most people capable of fixing problems?


Github did layoffs, yes.

We'll see how much longer the site can stay up. This is a death rattle.


GitHub is among many tech companies currently shedding fat from hiring during the pandemic boom. The morality of that is another discussion, but I'd hardly call this a death rattle.


Let's see you apply that same logic in this thread:

https://news.ycombinator.com/item?id=34715890


Twitter fired many more engineers, and the reason for those firings was Elon's takeover and temper tantrums. Many more companies had firings this winter and things aren't on fire anywhere but Elon's Twitter.


Lol, what a hilariously unfounded coclusion.


I read that with a /s


If only GitHub's underlying technology Git had a way to trigger actions when a file is updated and they could use that to invalidate cache! One day maybe?


Invalidate browser's cache?


This is not the problem as adding query parameters to bypass that doesn't work.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: