Sometimes I feel like the articles posted here are elaborate trolls. Like they wanted to know what the purpose of a cache is and how/when to implement one, but they didn't want to do research, so they came up with the worst idea they could imagine and blogged about it, hoping someone on HN would tell them the right way (or just laughing at people trying to correct them).
It gives me anxiety how much truly awful advice is upvoted here. It feels like children playing in a sand pit, and then one of the children dumps a box of rat poison on the ground, and some kid says how rat poison is actually good for you because it contains minerals or something. So they all start quickly scarfing it down. You try to tell them rat poison is bad for them, and then several of the kids start defending rat poison, with one lecturing you for being so negative about rat poison. I guess I should just let the kids poison themselves, but it's a terrible thing to watch.
There are way more terrible ideas than good ones, and that means treating all of them equally is a one-way road to HN losing the one thing it has going: new ideas.
Feedback like root comment is important to keep pushing ideas forward, instead of a loop.
I would honestly love to educate people here. But when the rabble come together to downvote you and dismiss what you're saying, the general perception from any reader is going to be that I'm wrong because so many people are disagreeing with me. Despite the fact that I'm literally an expert and they're not. So it's useless. (And just hurts my soul.)
Do you have specific criticisms? There are myriad and varied different use cases for a cache. I’ve used Postgres in a pinch. It’s fine, for certain things. I’ve used a lot of other caches as well that would be highly inappropriate for the situations I’ve used Postgres in.
In my experience there are far more highly-overwrought bloated architectures deployed than there are overly-clever minimalist ones. Everyone wants to put the cool tools on their resume.
So for one we do cache to remove load from dbms, then makes a lot of sense to remove load from dbms by giving it the load of cache too, another thing might be that separate services might be optimised for different tasks, redis is built with being first class in memory cache service, postgres is built with different first-class usage in mind, another one I can think of is resources, I can scale redis and dbms independently as their specific needs change
Overall for me this cache in postgres is a bad idea too
I see this as an issue of scale. HN loves to think about infinite scales where there are millions of simultaneous users of systems.
This is rarely the case.
Caching in Postgres is for smaller scale applications where the extra complexity of a dedicated memory cache isn’t worth the extra overhead.
But at least you bring up some actual limits that should to be considered.
DB load: part of the load on a DB is the network connection for the query, but a large part is also finding the data. If you have a complex query, storing a de-normalized form in a cache table would be a lighter load. (Provided that you aren’t limited by the network or number of concurrent connections)
I think that whether or not this is a good idea depends highly upon what scale you’re operating at. For high traffic applications, it’s not a good plan, but if you already have a DB setup and you have a lighter load, I could think of worse solutions.
I mean if you have a light application then maybe you dont even need the cache, at a certain threshold of small you can go ahead with sqlite or access, or excel, of course HN has an issue of scaling, because if we talk about caching then lets talk about situations where talking about caching makes sense,
Caching can make sense in more situations than "too many users running the same query at once". For basically anything slow that you'd like to make faster, you can weigh the downsides of caching against the downsides of optimizing its performance in other ways. In a startup context, often some (judicious) caching is preferable to spending a bunch of time carefully optimizing some code that's going to change or be thrown out in a month anyway.
Many of the technical blog posts that are most up-voted here on HN would be best described as content marketing. That the posts describe something technical is almost incidental to the point of making you aware that the company writing the post exists.
HN eats up all sorts of these posts from companies selling backend-as-a-service, low-code-web-UI-as-a-service, unreliable-replacement-for-Heroku's-free-tier-as-a-service, and so on...
Yes, and honestly his "why" section really reads like "A hammer can do everything a screwdriver can, and I really like my hammer so I'll use it to push screws into planks".
I was looking around on the internet, found the article, thought it was interesting and posted it.
I think that's being constructive. You on the other hand haven't added anything to the discussion (caching/postgres), but rather are yelling at the clouds.
If you think the article is such a bad idea, provide some alternative links. If you're not in the mood to do that, just go to the next thing you might find interesting on the endless scroll that is HN.
Being a curmudgeon is frankly worse than having sub optimal articles.
The problem is it's so hard to get on the HN Frontpage that stuff that does get there tends to be seen as having value but fools looking at at a foolish idea don't know any better.
OPs post was constructive. It indicates there might be a need for more heavy handed moderation and technical vetting of content before it reaches frontpage.
I was working at a company where we had stupid things from the tech world as loading message for slack, like “parsing html with regex”. I think “caching in postgres” might deserve a spot there too
For projects where I know the team will remain small (less than let's say 15 developers), I usually push to keep the architecture as simple as possible.
I've used something similar in the past, but kept the expiration code in the app code (Python) instead of using "fancy" Postgres features, like stored procedures. It's much easier to maintain since most developers will know how to read and maintain the Python code, that's also commited to the git repository.
Also, instead of using basic INSERT statements, you can "upsert".
INSERT INTO cache_items
(key, created, updated, key, expires, value)
VALUES (...) ON CONFLICT ON CONSTRAINT pk_cache_items
DO UPDATE SET updated = ..., key = ..., expires = ..., value = ...;
And since you have control over the table, you can customize it however you want. Like adding categories of cache that you can invalidate all at once, etc.
Postgres is also pretty good at key/values.
In other words, I agree that using Postgres for things like caching, key/values, and even maybe message queue, can make sense, until it doesn't. When it doesn't make sense anymore, it's usually easy to migrate that one thing off of Postgres and keep the rest there.
Also, one benefit that's not often talked about is the complexity of distributed transactions when you have many systems.
Let's say you compute a value inside a transaction, cache it in Redis, and then the transaction fails. The cached valued is wrong. If everything is inside of Postgres, the cached value will also not be commited. One less thing to worry about.
As mentioned this is a distributed system though, and realistically most micro services etc aren’t being fully rigorous about multi-phase transactionality and proper rollbacks etc.
Realistically it is the norm to yolo updates at a service and if it fails then the whole thing 500s and things are just in an unexpected state. Often it is not even possible to guarantee successful rollback etc - if your update back to original state fails then what is the application state now? Undefined and potentially invalid, pretty much. Most people just replay the request again and hope it succeeds.
Obviously the right answer is “don’t do that” or “offload that complexity into graphQL or something” but in the real world… people don’t.
Transactions can fail because they conflict with other transactions happening at the same time.
It's not an application bug. It's real life transactions happening on a production system. It's normal for that to happen all the time. The app can retry, etc., but it should be expected to happen.
Having to deal with distributed transactions is not something easy. Especially when they're part of many different systems.
For example, you'd have to wait until the transaction commits successfully before setting the value in the cache, which makes it hard to read.
Also, life in general happens. Compute a value, cache it, save things to the database, make API calls, and then a network error happens cancelling everything that you've just done. Having code that handles this kind of possibility is relatively hard to write/read.
Right but postgres isn't going to help with this if the application developer isn't doing safe and proper transaction management in the first place. What you described is a bug in the application logic for when and how to update the cache.
It's super hard to get this right. E.g. if you only update the cache after the transaction commits, you might commit without updating the cache, or if 2 writers interleave, the first one to commit might make the final update to the cache with a stale value.
Why, exactly, do we need to put a memory cache such as Redis in front of Postgres?
Postgres has its own in-memory cache that it updates on reads and writes, right? What makes Postgres' cache so much worse than a dedicated Redis?
Postgres can develop problematic behavior if you have high churn tables - tables with lots of deletes on them.
If you have many inserts and deletes on a table, the table will build up tombstones and postgres will eventually be forced to vacuum the table. This doesn't block normal operation, but auto vacuums on large tables can be resource intensive - especially on the storage/io side. And this - at worst - can turn into a resource contention so you either end up with an infinite auto vacuum (because the vacuum can't keep up fast enough), or a severe performance impact on all queries on the system (and since this is your postgres-as-redis, there is a good chance all of the hot paths rely on the cache and get slowed down significantly).
Both of these result in different kinds of fun - either your applications just stop working because postgres is busy cleaning up, or you end up with some horrible table bloat in the future, which will take hours and hours of application downtime to fix, because your drives are fast, but not that fast.
There are ways to work around this, naturally. You could have an expiration key with an index on it, and do "select * from cache order by expiration_key desc limit 1", and throw pg_partman at it to partition the table based on the expiration key, and drop old values by dropping partitions and such... but at some point you start wondering if using a system meant for this kinda workload is easier.
The buffer pool in a rdbms ends up intimately connected with the concurrency control and durability protocols. There's also a variety of tradeoffs in how to handle conflicts between transactions (steal vs no steal, force vs no force, etc). You need deadlock detection or prevention. That creates a necessary minimum of complexity and overhead.
By comparison an in memory kv cache is much more streamlined. They basically just need to move bytes from a hash table to a network socket as fast as possible, with no transactional concerns.
The semantics matter as well. PostgreSQL has to assume all data needs to be retained. Memcached can always just throw something away. Redis persistence is best effort with an explicit loss window. That has enormous practical implications on their internals.
So in practical terms this means they're in different universes performance wise. If your workload is compatible with a kv cache semantically, adding memcached to your infrastructure will probably result in a savings overall.
Because Redis is almost infinitely scalable while Postgres is not. You have limited vertical scalability budget for your database. The more things you put into your database, the more budget you spending on things that could be done elsewhere.
Sometimes it makes sense, when your workload is not going to hit the limits of your available hardware.
But generally you should be prepared to move everything you can out of the database, so database will not spend any CPU on things that could be computed on another computer. And cache is one of those things. If you can avoid hitting database, by hitting another server, it's a great thing to do.
Of course you should not prematurely optimize. Start simple, hit your database limits, then introduce cache.
How would the architecture in the OP mesh with master-slave postgres setups? If I write a cache item how can I be certain the freshest entry is read back from the read-only slave? Can/do I pay a performance penalty on writes waiting for it to be synchronized? Is it better, when it comes to caching, to ignore the slave and send all read/write cache related queries to the master?
All of these questions go away or are greatly simplified with redis.
They don't really go away, because if you need read-only replicas with PostgreSQL, there is a good chance that you will also need read-only replicas with Redis.
Similarly to Postgres, Redis replication is also async, which means that replicas can be out-of-sync for a brief period of time.
I was unsure to comment this: You can mark postgres replicas as sync replicas. Writes on the leader only commit fully once the writes are fully replicated to all sync replicas. This way postgres could ensure consistency across several replicas.
This however can come with a lot of issues if you started to use this to ensure consistency across many replicas. Writes are only as fast as the slowest replica, and any hickup on any replica could stall all writes.
What I wasn't sure about - IMO in such a situation, you should rather fix the application to deal with (briefly) stale information, and then you can throw either async postgres replicas at it.. or redis replication, or something based on memcache.
IME -- and I've just replaced a Postgres-only unlogged cache table with Redis -- it's not about the storage or caching, but about the locking. Postgres needs to acquire (or at least check) a lock for reading as well as writing. Although some optimizations have been done for mostly-reading-workloads (search for postgres fast-path locking), you'll still run into lock contention problems relatively quickly.
Machines used to have limited memory. Distributed caching can utilize many machines to form the overall cache. Nowadays machines have plenty of memory with numerous cores and fast bandwidth. The need for large network of cache servers has waned.
Even though PG caches it is still doing all the things to run the query. It is like saying why does a 3d render take so long to render an image when the same image saved to a PNG opens so much faster.
The article talks about using Unlogged tables, they double write speed by forgoing the durability and safety of the WAL. It doesn't mention query speed because it is completely unaffected by the change.
As far as I know there is no way to tell Postgres to keep a particular index or table in memory, which is one reason to be weary of using one PG instance for many varied workloads. You might solve this by earmarking workload-specific replicas, though.
If you can keep your entire working set in memory, though, then it probably doesn't matter that much.
Redis is completely in memory, therefore all data is in memory. Postgres on the other hand does have a cache of it's own, does not give you fine controls over what stays in cache. What stays in cache depends on data access patterns. E.g. I cannot make an entire table of my choosing to be in cache.
If you have a second machine, why not just put a Postgres read replica on it? Letting the WAL deal with replica consistency is much simpler than making the client responsible for keeping an external cache in sync, and you get the benefit of keeping everything in Postgres.
at a previous gig, we put a Redis cache in front of a slow monolithic Microsoft SQL Server db
that gave us breathing room to migrate the data we were interested in into a much smaller dedicated db for our purposes (the user data, used on every login and other operation in the system)
in a correctly specced RDS, MySQL can be set up to store all the data both in memory and on disk (note, I'm not referring to the ephemeral disk only db engine, I'm referring to giving the RDS the correct memory settings and parameters to store all the data both in memory and on disk)
once we did that, there was no difference in read speed between the Redis and just hitting the db, so we got rid of the Redis, which vastly simplified the architecture of the user service
= the article title is true, caching is often understood to be a default requirement but it kind of isn't if you architect things correctly
The reason I use Redis as a cache is raw speed. I would consider using this Postgres-based solution if the speed was comparable. The article leaves benchmarking up to the reader.
All "Postgres as an X" are valid to a point. If your use case doesn't require anything else and you prefer to build / test / maintain dedicated (a little snowflakey probably) logic, fine. Unfortunately, few such articles discuss the edges at which these solutions no longer work.
I like the idea of "defining the edges of suitability." Just because you can use it, does not mean you should.
I lean towards fewer tools. Postgres is an awesome swiss army knife. In some cases, using it as a key-value store, a cache, or a messaging queue are fine. In other cases, they're not.
Examine how suitable this particular tool is for your particular use case and your particular requirements.
Maybe you have zero DevOps skills, a background in DB administration, and your cache backs an external API. Perhaps using postgres is the right solution.
Maybe you know enough about databases to be dangerous, but feel comfortable managing virtual clouds environments. Perhaps DynamoDB or a Redis instance is the way to go.
There is no one answer. However, often there is a good default answer.
P.S. The default answer changes over time. There's always the flavor-of-the-year. The technology landscape is also constantly shifting. My heuristic is to use lean towards the default answer that has been around the longest.
I can say, using pg as a time series database works completely fine for < 1B rows. If you have significantly more than that, it is a good idea to start thinking about alternatives.
You almost always don't need a specialized or dedicated (blank). You can usually get away with using some boring off the shelf tech that can do the job, even if it isn't doing the job as perfectly as a dedicated (blank). Any additional cost in resources is probably offset by the savings in cognitive load and admin hours.
If you do, you're probably operating at a scale at which you have a whole big devops team to figure out, deploy, and manage your dedicated (blank). Or if you're really huge, you might be creating your dedicated (blank) in a way that's custom engineered around your problem.
Systems that are used as cache such as redis evict keys with some randomness to avoid scanning all of them. They also evict keys you read that expired without you having to check for that. And they don't store both a key and an id.
Given the value of redis for caching and outside of caching, bur the ease and low cost of setting it up, I would say postgres as a cache seems niche.
This appears to be riding the [Solid Cache](https://dev.37signals.com/solid-cache/) idea that DHH has started espousing. At the core of this idea is that SSD's are sufficiently fast for caching. Sure, this may work for 37signals, but that is because they manage their own hardware. That is the exception and not the rule. And, per [this discussion two weeks ago](https://news.ycombinator.com/item?id=39443679). SSD's aren't as fast for most of us that leverage "the cloud".
Sure, you can do it, but why? I am all for removing complexities in my infrastructure. That said, Memcached and/or Redis are both rock solid and dead simple.
This just seems to be a waist of time 99.999% of use cases.
Re. Solid cache they effectively answered this in the post. The cost is less meaning they can cache more so p95 response time goes down. If you're not constrained my money you can just buy a bigger memcached or Redis of course.
If you're talking about moving things to Postgres /in general/, then cost and/or complexity are compelling reasons to do so. Small engineering teams, small budgets, simplifying and reducing cost can really help. Obviously it's not suitable for everyone but it's nice to have the option.
In general, caches need to be fast. However, the Solid Cache article you linked says:
> On Basecamp, compared to our old Redis cache, reads are now about 40% slower. But the cache is 6 times larger, and running on storage that’s 80% cheaper.
It seems that their caching needs don't require the speed that Redis provides. Instead, they can get an overall performance improvement in their application if they cache more things -- that's why they want a bigger cache.
Based on this, it seems that Redis was too fast and too expensive. Being self-aware of your needs is key. This way, BaseCamp could tailor a solution that suits their needs.
You don't always need the fastest, the biggest, the most scalable, or the cheapest. If you know the thresholds, you can come up with a better tradeoff.
So presumably the reason why you'd do something like this is if you have expensive queries or calculations, but in that case you could use materialized views instead, depending on your data and queries obviously.
If it's to serve as a cache on an application level, memcache seems like it would be simpler and faster. While I do like Redis, I can see why you'd be careful introducing it. Redis can blur the boundaries between caching and data storage a bit, but it's just so handy.
Why wouldn't you simply use SQLite (or some other in-memory flavor of SQL) instead of hacking the main Postgres db and adding load to the primary instance?
The author makes a valid point that there's something nice about using familiar tooling (including the SQL interface) for a cache, but it feels like there are better solutions.
Because SQLite is in process. Usually, when you start thinking about cache, you have more than one application server. Each application server running its own cache make cache invalidation a nightmare (I worked in a company where one genius did that and caused a lot of troubles). Don't show me any sqlite replication things because that's not how you want your cache to work.
My issue with running PostgreSQL as cache would be its thread per connection model and downsides of MVCC for cache.
> hacking the main Postgres db and adding load to the primary instance
nobody said this had to be on the primary database server, and how is this hacking?
Is every app server going to have its own local "sqlite cache"? Or is it going to use one of the sqlite server/replication things? So why not just use PG?
That's a bit of a strawman argument. Per the post, you can't leverage this on a read replica, it has to be run on primary. So you're going to stand up and manage a full new Postgres instance for this?
I'm sure there are many cases when that makes sense, but there are many cases when that's also overkill. An in-memory cache inside your server will give you better performance, and a lot of less infrastructure maintenance complexity.
The explanation is not convincing enough. Good to know info but I can't see this useful for a large scale production system. There is no description of cache invalidation strategies.
But it's highly voted on HN so may be I am missing something.
I was checking out used rack servers on ebay and servers with 768GB RAM are not very expensive, they are under $3k USD... seems that if you could cache 256GB to 512GB of Postgres into RAM you could do pretty well...
If you own the hardware, pricing becomes an entirely different question. Cloud will make you pay heavily for RAM where this strategy of using disk backed cache suddenly shines.
It’s a weird article. It claims that one only needs Postgres but then presents a half-baked solution: the takeaway I got is “it’s not HA or horizontally scalable, and it’s probably slow” so it sounds like I _do_ need a dedicated cache service?
They just handwave benchmarking against cache databases as “out of scope”, that is very much in scope to claim I can use PG rather than redis.
Also a cache would impact the rest of the application (reading/writing to this cache consumes cpu and IO).
Also, missing an index on the inserted_at and there’s no uniqueness constraint on the key.
Overall, it reads like someone read about UNLOGGED and thought it could be useful for caching use cases. I don’t think anyone doubted that PG can be used a cache really, the trade offs would be a more interesting read than the POC
The reason you build cache for most purposes, is fast data retrieval. A major component of "fast" is the latency of a disk read vs a memory read. Redis and memcached are in memory by design, so entirety of your cached table will be in memory. When you design a cache using postgres (assuming on an RDS instance), although data will be cached in memory based on access patterns, I don't believe there is a direct way to control what data gets cached, which makes this implementation slower for a lot of real world use cases.
This gets even slower when you have a distributed DB like AWS Aurora Postgres because your data on disk can be on different EBS volumes, so bringing it in memory can be slower than even RDS Postgres.
It gives me anxiety how much truly awful advice is upvoted here. It feels like children playing in a sand pit, and then one of the children dumps a box of rat poison on the ground, and some kid says how rat poison is actually good for you because it contains minerals or something. So they all start quickly scarfing it down. You try to tell them rat poison is bad for them, and then several of the kids start defending rat poison, with one lecturing you for being so negative about rat poison. I guess I should just let the kids poison themselves, but it's a terrible thing to watch.