I will add another suggestion, if you use S3 at all... one of your largest costs is likely the bandwidth. Have you considered just placing caches in front?
I took a $100 per month S3 bill down to $5 per month by simply having existing Nginx servers enable a file cache.
It does help that I never need to cache purge (versions are saved in S3 and become URLs), but it was super trivial to just wipe out $95 per month of cost for zero extra spend.
My current setup is:
S3 contains user photos, web app (not on AWS) handles POST/GET for S3 (and storing local knowledge), Nginx at my edge has a cache that is currently around 28GB of files, and then CloudFlare in front of all of this saves me around 2TB of bandwidth per month.
The real gotcha for me is that I was relying on CDNs for my cache, but when CDNs reach 50+ PoPs I was starting to see multiple requests for the same thing as a result of people in different cities requesting a file. So the Nginx cache I've added mostly deals with this scenario and prevents additional S3 cost being incurred.
> The real gotcha for me is that I was relying on CDNs for my cache, but when CDNs reach 50+ PoPs I was starting to see multiple requests for the same thing as a result of people in different cities requesting a file.
Having never used a CDN, this sounds weird. This means that the caches are not synchronized between PoPs, even though they're supposed to be at the same level in an HTTP request. Is this normal behaviour for a CDN ? I'd expect one PoP to check with other PoPs before hitting upstream.
The theory being that the PoP closest to the origin is the one responsible for going to the origin and thus that other PoPs will fetch cached items for the closest PoP to origin.
Nearly all of the very large CDNs support some degree of hierarchical caching, and the ones becoming large are gaining the capability.
At CloudFlare (where I work) the need for a hierarchical cache was low priority until we started to rapidly increase our global capacity... once you reach a certain scale then the need for some way to have PoPs not visit the origin more than once for an item becomes very important. You can be sure we're working on that (if you are an Enterprise customer you could contact sales to enquire about beta testing for it).
But right now, for us and many other providers, just enabling an nginx cache in front of any expensive resource will help. By "expensive", I generally mean anything that will trigger extra expenditure or processing when it could be cached.
Edit: Additionally, nearly every CDN is operating an LRU cache. You haven't bought storage and not everything you've ever stored is being held in cache. You only need a few bots (GoogleBot, Yandex, Baidu, etc) constantly spidering to be pulling the long tail of files from your backend S3 if you haven't got your own cache in front of it. Hierarchical caching isn't a silver bullet that takes care of all potential costs incurred, but having your own cache is.
Thanks for the detailed answer. The only technical understanding of a CDN I have is from CoralCDN (http://www.coralcdn.org/), a p2p CDN operated by researchers on PlanetLab (with servers all around the world) that was especially created to mitigate flash crowds scenario, and their very documentation on this experiment (http://www.coralcdn.org/pubs/, particularly http://www.coralcdn.org/docs/coral-nsdi04.pdf). I was impressed by how the nodes automagically coordinate themselves so that as few as possible will query upstream, and then content will be slowly propagated to other nodes that were requesting for this exact content in a form of multicast tree, all with 0 administration and thanks to a smart DHT and a smart DNS implementation. Really cool stuff. I was under the impression that any commercial CDN was doing at least the same.
It depends where your origin is, where your users are (near 1 PoP? 2 PoPs? spread evenly globally?), how frequently files that can be cached are requested, the capacity of each PoP, the contention of each PoP, etc.
A few years ago when the customer and traffic growth rate exceeded the network expansion rate the answer was probably "not big enough", but we've since upgraded almost every PoP and added a huge number of new ones: https://www.cloudflare.com/network-map/
The answer now is "more than big enough".
We cache as much as possible, for as long as possible. The more requested a file, the more likely it is to be in the cache even if you're on the Free plan. Lots of logic is applied to this, more than could fit in this reply.
But importantly; there's no difference in how much you can cache between the plans. Wherever it is possible, we make the Free plan have as much capability as the other plans.
Most CDNs by default make a request per POP. You can usually enable "origin shield", which will cause the CDN to make a single request and distribute the file to POPs internally.
I use KeyCDN and they offer an "Origin Shield" so there is only one request to the file and the PoPs request the file from the shield server. Other CDNs probably offer that as well.
I use https://www.netlify.com over S3, they're static website hosting but they let you remap urls to proxy w/e backend. A big chunk of bandwidth is included and it mitigates AWS per-request billing too.
the purpose of CloudFlare’s Service is to proxy web content, not store data.
Using an account primarily as an online storage space, including the storage or
caching of a disproportionate percentage of pictures, movies, audio files, or
other non-HTML content, is prohibited
The site in question is also on CloudFlare, and so this clause does not apply. The cached assets are neither disproportionate nor is this being used as a storage space (S3 is still that).
The clause is there to provide a tool to deal with some of the interesting things that people will try and use a CDN with no bandwidth charge for.
i.e. what if whole movies were given a different content type and file extension, would CloudFlare suddenly be the world's largest CDN of pirated movies?
Most such questions have been tried already, the clause allows us to answer "only for an exceptionally short amount of time".
I took a $100 per month S3 bill down to $5 per month by simply having existing Nginx servers enable a file cache.
It does help that I never need to cache purge (versions are saved in S3 and become URLs), but it was super trivial to just wipe out $95 per month of cost for zero extra spend.
My current setup is:
S3 contains user photos, web app (not on AWS) handles POST/GET for S3 (and storing local knowledge), Nginx at my edge has a cache that is currently around 28GB of files, and then CloudFlare in front of all of this saves me around 2TB of bandwidth per month.
The real gotcha for me is that I was relying on CDNs for my cache, but when CDNs reach 50+ PoPs I was starting to see multiple requests for the same thing as a result of people in different cities requesting a file. So the Nginx cache I've added mostly deals with this scenario and prevents additional S3 cost being incurred.