I will add another suggestion, if you use S3 at all... one of your largest costs...

rakoo · on April 6, 2016

> The real gotcha for me is that I was relying on CDNs for my cache, but when CDNs reach 50+ PoPs I was starting to see multiple requests for the same thing as a result of people in different cities requesting a file.

Having never used a CDN, this sounds weird. This means that the caches are not synchronized between PoPs, even though they're supposed to be at the same level in an HTTP request. Is this normal behaviour for a CDN ? I'd expect one PoP to check with other PoPs before hitting upstream.

buro9 · on April 6, 2016

It's normal for most smaller CDNs to not have their PoPs communicate, yes.

With larger CDNs you start to get hierarchical caches: https://trafficserver.readthedocs.org/en/5.3.x/admin/hierach...

The theory being that the PoP closest to the origin is the one responsible for going to the origin and thus that other PoPs will fetch cached items for the closest PoP to origin.

Nearly all of the very large CDNs support some degree of hierarchical caching, and the ones becoming large are gaining the capability.

At CloudFlare (where I work) the need for a hierarchical cache was low priority until we started to rapidly increase our global capacity... once you reach a certain scale then the need for some way to have PoPs not visit the origin more than once for an item becomes very important. You can be sure we're working on that (if you are an Enterprise customer you could contact sales to enquire about beta testing for it).

But right now, for us and many other providers, just enabling an nginx cache in front of any expensive resource will help. By "expensive", I generally mean anything that will trigger extra expenditure or processing when it could be cached.

Edit: Additionally, nearly every CDN is operating an LRU cache. You haven't bought storage and not everything you've ever stored is being held in cache. You only need a few bots (GoogleBot, Yandex, Baidu, etc) constantly spidering to be pulling the long tail of files from your backend S3 if you haven't got your own cache in front of it. Hierarchical caching isn't a silver bullet that takes care of all potential costs incurred, but having your own cache is.

rakoo · on April 6, 2016

Thanks for the detailed answer. The only technical understanding of a CDN I have is from CoralCDN (http://www.coralcdn.org/), a p2p CDN operated by researchers on PlanetLab (with servers all around the world) that was especially created to mitigate flash crowds scenario, and their very documentation on this experiment (http://www.coralcdn.org/pubs/, particularly http://www.coralcdn.org/docs/coral-nsdi04.pdf). I was impressed by how the nodes automagically coordinate themselves so that as few as possible will query upstream, and then content will be slowly propagated to other nodes that were requesting for this exact content in a form of multicast tree, all with 0 administration and thanks to a smart DHT and a smart DNS implementation. Really cool stuff. I was under the impression that any commercial CDN was doing at least the same.

aembleton · on April 6, 2016

What size is the Cloudflare cache on the Free and Pro plans?

buro9 · on April 6, 2016

It's a question without an exact answer.

It depends where your origin is, where your users are (near 1 PoP? 2 PoPs? spread evenly globally?), how frequently files that can be cached are requested, the capacity of each PoP, the contention of each PoP, etc.

A few years ago when the customer and traffic growth rate exceeded the network expansion rate the answer was probably "not big enough", but we've since upgraded almost every PoP and added a huge number of new ones: https://www.cloudflare.com/network-map/

The answer now is "more than big enough".

We cache as much as possible, for as long as possible. The more requested a file, the more likely it is to be in the cache even if you're on the Free plan. Lots of logic is applied to this, more than could fit in this reply.

But importantly; there's no difference in how much you can cache between the plans. Wherever it is possible, we make the Free plan have as much capability as the other plans.

pquerna · on April 6, 2016

It depends on the CDN, but for example Fastly calls what the op was suggesting "origin shield": https://docs.fastly.com/guides/performance-tuning/shielding

barake · on April 6, 2016

Most CDNs by default make a request per POP. You can usually enable "origin shield", which will cause the CDN to make a single request and distribute the file to POPs internally.

kels · on April 6, 2016

I use KeyCDN and they offer an "Origin Shield" so there is only one request to the file and the PoPs request the file from the shield server. Other CDNs probably offer that as well.

benologist · on April 6, 2016

I use https://www.netlify.com over S3, they're static website hosting but they let you remap urls to proxy w/e backend. A big chunk of bandwidth is included and it mitigates AWS per-request billing too.

nstart · on April 6, 2016

this is a great one. We are going to be looking into this soon hopefully

mwcampbell · on April 6, 2016

I think your use of CloudFlare might be a violation of their terms of use:

https://www.cloudflare.com/terms/

Specifically, see section 10, "LIMITATION ON NON-HTML CACHING".

buro9 · on April 6, 2016

I work for CloudFlare.

The key part of that clause is this:

    the purpose of CloudFlare’s Service is to proxy web content, not store data.
    Using an account primarily as an online storage space, including the storage or
    caching of a disproportionate percentage of pictures, movies, audio files, or
    other non-HTML content, is prohibited

The site in question is also on CloudFlare, and so this clause does not apply. The cached assets are neither disproportionate nor is this being used as a storage space (S3 is still that).

mwcampbell · on April 6, 2016

Ah, OK. I thought the key concern was bandwidth, not storage.

buro9 · on April 6, 2016

It is.

From Amazon S3, who charge you for it.

Then to Linode who don't charge if you're below an allowance.

Then onto CloudFlare who do not charge you for it at all.

I was a CloudFlare customer long before being a CloudFlare employee.

mwcampbell · on April 6, 2016

How can CloudFlare afford to not charge for bandwidth at all? I figured the point of section 10 was to prevent abuse of that free bandwidth.

buro9 · on April 6, 2016

The clause is there to provide a tool to deal with some of the interesting things that people will try and use a CDN with no bandwidth charge for.

i.e. what if whole movies were given a different content type and file extension, would CloudFlare suddenly be the world's largest CDN of pirated movies?

Most such questions have been tried already, the clause allows us to answer "only for an exceptionally short amount of time".

homero · on April 6, 2016

I know! Because they pay for percentile, having high output makes their ddos input free

buro9 · on April 6, 2016

This is also true :)

And we can negotiate better prices with more traffic, etc.