KubeDB – Run production-grade databases easily on Kubernetes

stevenacreman · on Dec 30, 2018

It's good to see a project focussing on production-grade databases on Kubernetes. Particularly the production grade part.

There are 33 open source operators for managing databases on Kubernetes. Out of that list only 3 claim to be production ready.

Out of 126 Operators that I've looked into the vast majority are abandoned and unfinished. Most state the project status as Alpha in the readme.

Kubedb itself has a version number of 0.8.0 for the operator and very low version numbers for the databases. For example version 0.2.0 for Redis.

Version numbers can mean anything but they are usually a good indicator of what the project owner thinks the status is.

It would be cool to see a break-down of status and expected dates for milestones for Kubedb.

For anyone interested in browsing other Operators I keep a table updated half way down this blog post.

https://kubedex.com/operators/

The project statuses come directly from what the authors have stated. Many beta status projects are being used in production.

manigandham · on Dec 30, 2018

Part of the problem is that Kubernetes itself is still changing rapidly and already has design-by-committee cracks in the API.

It would help if the community took a break from new features and worked on stability first so that Operators and other extensions can finally take off. Some of the things being developed now are so esoteric that it seems to be more about finding the next exciting thing to add than usability.

shaklee3 · on Dec 30, 2018

You're using that term in a derogatory sense. Would you rather have Google decide how everything is designed, and everyone else has to deal with it? I think you'd see a ton of GCP-specific stuff if that were the case.

I used to think how you did about kubernetes because I saw just how long it took for features I really wanted to get in. Then I attended some of the SIGs, and realized that there are so many use cases out there unlike mine, and that doing what I want may break what others want. So instead of making a decision that screws over everyone but one cloud provider, what I've seen is very methodical and careful decision making from many companies working together. This usually means that you get something that may not do exactly what you want out of the box, but there are hooks to do it if you'd like. I'd much prefer this over nothing at all.

It would be worth sitting in on a SIG you're interested in, and see how @smarterclayton and @thockin handle these kinds of decisions. I see so much negativity on HN about k8s, and it really seems like people just don't appreciate the amount of attention that goes into each decision. I think if you spend the time to trace the history of a feature and understand why things are done, it may change your mind about how complex k8s is.

smarterclayton · on Dec 30, 2018

What are some of the design by committee cracks that you think should be addressed?

derefr · on Dec 30, 2018

> Some of the things being developed now are so esoteric that it seems to be more about finding the next exciting thing to add than usability.

Or perhaps it's real ops people with particular arcane needs, each scratching their own itches?

K8s is a large FOSS project; and like most large FOSS projects, most PRs are from corporate contributors that wrote the code for their own purposes and then wanted to upstream it to avoid having to maintain a fork.

bogomipz · on Dec 31, 2018

>"Part of the problem is that Kubernetes itself is still changing rapidly and already has design-by-committee cracks in the API."

Could you elaborate a bit on what those "cracks" are?

shaklee3 · on Dec 30, 2018

Stability of what?

sitkack · on Dec 30, 2018

What would a production-grade conformance test suite look like for K8s to get these operators to 1.0?

I am mostly a bystander, but in the k8s issues I see, it is too easy to either destroy all the pods or their volumes. Maybe this should be fixed at the k8s level.

ryukafalz · on Dec 30, 2018

>too easy to either destroy ... their volumes.

As someone who's started running services in Kubernetes (albeit mostly as a hobby thus far) I would recommend setting the ReclaimPolicy to Retain for any PersistentVolumes that are particularly important. The default behavior is to delete the underlying volume when the resource representing it is deleted, but if you're worried that might happen accidentally that may not be what you want; this behavior is configurable.

mdaniel · on Dec 31, 2018

> Maybe this should be fixed at the k8s level.

FWIW, it has been: RBAC allows you to strip -- or I guess pragmatically speaking, not assign -- rights at whatever level of granularity you have the patience to maintain. It is also bright enough to do that per Namespace, so going light on the ClusterRoleBindings and keeping things out of the "production-db" Namespace would likely go a long way toward addressing the risk you are describing

markbnj · on Dec 30, 2018

I'm wary of the operator model in general, and we haven't had great success using operators to deploy complex stateful services in our clusters. But to be honest we also haven't had great success deploying them using OTS charts from helm stable either. One of our k8s stateful services is a large elasticsearch cluster indexing about 150m events per day, and the chart was forked and heavily modified by us to get it right. I feel that complex stateful services often have enough devils in the details that trying to implement them through an abstraction gets you into trouble. Operators aspire to be a "smart agent" that can translate a CRD resource declaration into a functioning thing, allowing you to implement your data store at an even higher level of abstraction than a helm chart provides. Since in my experience charts are themselves too abstract for this purpose (you either end up forking/modifying or, if the chart actually provides full coverage of the configuration options, creating a whole new hard to comprehend API to the k8s resources you're trying to deploy), I'm not that excited about having a back-end clippie that can do it for us. It's probably fine for simple use cases, and especially those where you often need to create and destroy simple dbs, but imo not yet for large production use cases.

marcc · on Dec 31, 2018

The Operator/CRD pattern is promising for autonomously operating simple use cases of existing software and for operating really complex software that needs very specific, rare knowledge to operate.

Unfortunately, we aren’t there yet for most software. Let’s take Postgres as an example. Even though you have to manage your pg database manually (or use a service that manages it for you), that’s just because the right automation software hasn’t been built yet. Someday, a Kubernetes Operator (or equivalent implementation) will exist that can manage a large Postgres cluster better than a team of DBAs. It’s crazy that there are hundreds (thousands?) of configuration parameters in Postgres, and these are coupled to the operating system settings in weird and unexpected ways that most people don’t know. We should be building this knowledge into a K8s Operator and letting that control our pg.conf and os configuration, instead of giving that control over to a team of humans who might be able to put in some sane defaults, but will always be working to get the optimal performance out of Postgres as the usage share changes.

This exists in some places already. For example, Rook is a K8s operator that provisions and manages Ceph in a Kubernetes cluster. As a small startup, if I need this functionality, I don't want to hire a full time Ceph admin to figure it out, and I don’t have the expertise to take on operating Ceph myself. Rook productized operating Ceph for us, and “baked in” all of the needed knowledge to manage block and object store and even set up concurrent, shared file systems. I trust Rook to manage Ceph, and I don’t think that I could do a better job with human intervention.

We have a long way to go. Operators are a tool that might help get us there but Operators are just a pattern that exists that we can use. One thing for sure is that we shouldn’t assume that human control over complex software is required to achieve optimal performance.

andrenth · on Dec 31, 2018

What's your strategy for handling possible filesystem failure/corruption scenarios without a team that understands the underlying technology?

marcc · on Dec 31, 2018

That's a great point. I do have a team that understands the underlying technologies and has been successful in troubleshooting several production problems with Rook/Ceph, one recent one including file system corruption. My original post is just trying to state that our engineering team does not maintain a deep operational knowledge of the best way to configure, manage, monitor, scale, etc (operate) ceph in production. We rely on the Rook operator for this.

Troubleshooting acute outages caused by hardware or software failures requires a different skill than properly configuring the system to scale and minimize the chances of a corruption or outages. Rook solves the later, but we do understand the architecture and what Rook (and Ceph) are doing. We've just removed the expert level, craftsman, speciality knowledge required to operator Ceph because we decided, after a thorough evaluation, that the software in this case is the most capable solution.

andrenth · on Dec 31, 2018

I find this unusual because usually the knowledge require to troubleshoot a complex piece of software is much more complex than that required to set it up in the first place. In other words, how can you troubleshoot it if you don’t know how it’s built?

It’s a bit like debugging software you didn’t write.

ec109685 · on Dec 30, 2018

I agree you do not want to be running half baked operators on your cluster, but clouds like Amazon’s and Google’s have shown it is possible to harness open source software like ElasticSearch and successfully run it on behalf of customers.

It will take time for these to harden, but it eventually will since the primitives are all there.

That said, a very significant issue that the public clouds don’t face is the team creating the operator in aws / gcp’s case is the one running it and they fully understand exactly the configuration the operator will be deployed to.

With helm and operators designed for public consumption by other companies, the amount of generality needed will be higher, at least until the conformance tests for a kubernetes implantation get more detailed and there is ci/cd between the operators and charts running on clusters that better mirror various Kubernetes deployment choices.

cyphar · on Dec 30, 2018

The thing that's always bothered me about the operator model is that you have to reinvent all of the things that Kubernetes does for its own objects (failure handling, redundancy, declarative configuration). At least that's my understanding, from what I've seen.

smarterclayton · on Dec 30, 2018

Arguably all kube is just a pattern library, and the patterns fit reasonably well together. But patterns are situational - you’re always going to have to fill the 20% oc uncovered space with something unique to your use case. Operators are more like frameworks using those patterns. They can reduce time to value, but once you hit the areas where you are fighting the framework you might want to implement those patterns yourself.

When we designed stateful sets, it was to make the minimum bar easier to get unique network identity (which is necessary, but not sufficient for most cluster software). And in practice I’ve seen people using sets directly with a thin layer of scripting or helm on top, but I’ve also seen people implementing their own stateful sets because the logic isn’t that hard once pods had the necessary shims.

I would probably say that kube is best thought of as an extensible compute pattern framework (you can leverage the lowest level atoms or build layers on top). Kube is probably only successful as long as we keep the 80% easy, reduce the cost to add new patterns (libraries and tools), and decompose the bits that should be replaceable.

Most of the “reinventing” problems with operators are problems with kube having weak libraries - each controller and operator is somewhat bespoke. That’s something I expect to see improved this year via the various tooling libraries. But it’s still a work in progress and I regret it took so long.

keypusher · on Dec 30, 2018

While I have completely embraced running stateless services in Docker, I have been hesitant to migrate the database layer to containers. While I have not tested it personally, I have seen numerous reports of performance issues when using volumes. Is this no longer an issue, or was it limited to bind mounts? Do volumes not use the storage driver? Also, I have run into permission issues when using volumes with Docker, which I'm sure was just my own ignorance but it does seem like a cause for confusion and potential error. I have read through the documentation on the linked page, and the quickstart guides for KubeDB seems great for getting up and running, but I do worry about situations like if an automated PG database failover can't reconcile a timeline, there isn't much documentation on failover at all and this could add significant complexity to something that is already a potential nightmare. Anyone care to share their experiences running production databases in k8s?

user5994461 · on Dec 30, 2018

Running production databases in docker last year: https://thehftguy.com/2016/11/01/docker-in-production-an-his...

Performance issues should be the least of your concern. The docker deamon and container simply hanged because of filesystem issues on CentOS 6.

I worked at a company that was dockerizing their stateless services, then planning to dockerize their cassandra databases. Multiple contractors involved.

Stateless services failed periodically because of the above issue. Load balancers can failover automatically, broken nodes are rebooted from time to time, limited impact. Noone cared, just a daily deployment routine.

I fear the day the cassandra dockerization would happen. They'd lose their entire customer data (hundreds of millions of customers) once two nodes would fail simultaneously, which happened a lot on the stateless services.

Thankfully the project never started and the company didn't go bankrupt. Pretty sure employees moved around and plans got canceled.

Expect a lot of instability in docker around filesystem, performance issues and race conditions. Low volume stateless web servers don't get to trigger issues much, but databases do.

gnufied · on Dec 30, 2018

I can't possibly hope to change your mind but stability issues with union filesystem driver in docker(part of it was not even docker's problem) and persistent volumes of kubernetes are two very different things. Cassandra running standalone on host(and crashing) is no different from cassandra crashing when running using a PV inside a container.

Moreover - all/most Linux distros have switched to using Overlay2 as default driver. If you are running latest version of RHEL/CentOS/Fedora/Ubuntu that is the driver you will be most likely using.

user5994461 · on Dec 31, 2018

Don't get me wrong, I know it's not a bug in kubernetes, it's a bug in the filesystem. Kubernetes is as stable as the weakest part and the weakest part is the container engine (docker and underneath).

Containers require volumes/filesystems to run and some implementations are buggy as fuck.

Docker abandoned CentOS 6 many years ago, whether they stated officially or not, the last docker package and kernel/drivers are unstable. Similar story on some other distributions.

It wasn't production-ready at all back then and it's still not a good idea to containerize databases now. Besides bugs that come and go, there are other challenges around lifecycle, performance and permissions that are not trivial to deal with.

bogomipz · on Dec 31, 2018

>"I can't possibly hope to change your mind but stability issues with union filesystem driver in docker(part of it was not even docker's problem)"

Can you outline what those stability issues are/were? Was the non-Docker part of the problem kernel related? Genuinely curious.

user5994461 · on Dec 31, 2018

See RHEL and Debian sections: https://thehftguy.com/2017/02/23/docker-in-production-an-upd...

The filesystem drivers are buggy as fuck. You would experience kernel panics on Debian Jessie (overlayFS), or containers + docker daemon hanging on CentOS 6 (devicemapper). The fix in both cases is a reboot.

You might not notice it if you barely used docker, but it can be very outstanding at scale. I've been consulting briefly at a major web company that was deploying their web services to 5-20 nodes, daily. On every service deployment there would be up to 3 nodes dying.

ex3ndr · on Dec 30, 2018

For sure it is a very different thing. Local SSD or remote drive? That means a lot for Cassandra.

atombender · on Dec 30, 2018

Kubernetes supports local volumes. With GKE you get local SSDs.

ex3ndr · on Dec 31, 2018

That doesn't make sense to use GKE for this. Eventually you will just have bunch of VMS that run only your DB (since you need to avoid interference of other workloads) and there are no support for multi DC mode... And what benefits? Restarting SQL or Cassandra is not very cheap operation and can cause large data migrations.

geggo98 · on Dec 31, 2018

In the Cassandra case, you would not write the persistent data in the Docker image (that's the part of the file system mounted as a layered file system, using AUFS or OverlayFS). Instead, you would write it in a volume. For a local volume, that's just a part of the "normal" file system (Ext4, XFS, ...) exposed to the Docker container through a bind mount.

Volumes are quite stable and reliable when based on a stable file system.

So while you could lose the container due to the but describe, you would not lose the persistent data.

It's best practice not to write to the Docker image at all during runtime (no log files, no PID file, etc), but to write only to volumes or tmpfs mounts. I'm a little bit suspicious about the crashes you described: are you sure you followed that best practice?

ofrzeta · on Dec 30, 2018

Local volumes don't use the storage driver. Therefore there should be no performance penalty as far as I know.

EDIT: There's probably some overhead with the implementation of namespaces and cgroups but I have not found any reliable sources about the quantities. As a side not if you are using memory limits you will have a performance penalty: "Memory and swap accounting incur an overhead of about 1% of the total available memory and a 10% overall performance degradation, even if Docker is not running." (from the Docker docs). This will probably affect databases running in containers as well.

gnufied · on Dec 30, 2018

Yeah this can't be stressed enough. Persistent volumes in Kubernetes are simply bind mounts that exist in your container's namespace. They don't go through docker's storage driver at all. There should be zero penalty for using persistent volumes this way within containers.

cpuguy83 · on Dec 30, 2018

Sorry, this seems to insinuate that Docker is slowing down (or using slow) storage... this is completely false.

Volumes in docker are just host bind mounts.

Now depending on your driver/opts (similar to K8s PV backends) this storage can come from anywhere and performance of the volume is totally dependent on the type of storage being used.

Now, the container fs that docker sets up is (usually) using a CoW filesystem, and as there is overhead there... but volumes are specifically designed to bypass the container fs.

gnufied · on Dec 30, 2018

Hmm, sorry I did not mean to imply that. It is indeed true that both docker volumes and k8s PVs are just host bind mounts.

However I did mean to clarify that - k8s does not uses docker volumes. Whatever performance issues one notices could either be associated with storage provider or container writable layer.

cookiecaper · on Dec 30, 2018

There is no good reason to run non-test database workloads in Kubernetes or Docker. Databases are designed to sit close to the hardware and have a stable, dedicated chunk of resources for a long time, whereas Kubernetes pods are subject to vaporization at any moment. Databases traditionally have fought the operating system to try and maintain enough control to remain performant. Introducing additional layers into this would be dubious at the best of times, but when it's something fundamentally contrary to the application's nature like stateless orchestration, it's pure farce.

There could not be an application worse-suited to running in Kubernetes et al than a traditional database. Anyone claiming something that rams this square peg into that round hole is "production ready" is showing that they're an empty husk and shouldn't be trusted near anything important.

Note the downvotes already rolling in less than two minutes after I posted this. This subject is a major third rail here. It goes against the agenda of very powerful people and my account has been censured in the past specifically for making this particular argument, that database workloads and Kubernetes don't mix. Keep that in mind when you're asking HN for their experience on this (or any other topic that YC considers critical to the interest of their investments -- they've shown that they're willing to taint the discussion if it gets too dicey).

derefr · on Dec 30, 2018

Using Kubernetes doesn't imply using Docker, even. K8s is 99% an orchestration system, like Terraform or CloudFormation. One resource among many that it orchestrates is containers. It can also orchestrate regular VMs.

That being said, I also disagree that Docker isn't suited to running a DBMS, assuming you actually have a large enterprise (or cloud) datacenter backing your Docker daemon. In such cases:

• You'll probably have a large enough pool of Docker machines (k8s or not) that you're going to be deploying your DBMS container in a way that reserves an entire instance just for it (or it + its accessory containers);

• You'll probably have a SAN, and you'll have many enterprise-y reasons (e.g. live VM migration) to prefer backing your DBMS with said SAN, rather than with local instance storage.

If both of those are true, then Docker has no disadvantages compared to deploying your DBMS as a raw VM.

0xbadcafebee · on Dec 30, 2018

As an "enterprisey" person, I disagree. I've seen a lot of enterprise infrastructure that looks like toddlers built it out of lincoln logs. And I've seen SANs lose connectivity much more often than a pool of independent local disks all going bad at once. On top of that, databases run on VMs that aren't on hypervisors dedicated for running databases results in shitty adminning and overcrowded VM pools destroying database performance+reliability.

Cloud-ish infrastructure is often good for running distributed decentralized databases, but try running Oracle in a bunch of Docker containers on a crappy OpenStack cluster and soon you'll be crying into your scotch.

These efforts to make people think it's a good idea to run databases on K8s are misleading people, and god help those poor teams that waste years trying to stabilize something that a fancy web page and a youtube tutorial said was a great idea.

ownagefool · on Dec 30, 2018

I assume you also consider databases on cloud with mounted block storage as not production ready too?

cookiecaper · on Dec 30, 2018

For the record, I have to use my reply allocation sparingly, since usually when I start talking about this I'm mysteriously throttled for long periods.

That said -- no, that's not the same thing at all. Barring anomalous conditions, VMs run as long as you keep them running. They won't be reaped and rescheduled onto some other node in the cluster, whether by automated rebalancing processes or by manual `kubectl delete po...` or `kubectl drain`. You can easily set up a VM that will behave more-or-less like conventional hardware if we ignore the perf hit.

This is a pretty simple thing. The reason people say you need to make your apps "12 factor" when you go to k8s is because it doesn't work well if your app cares about state. Databases care deeply about state. You can't just kill a DB server and spin up a new one to pick up where it left off. You can't parallelize a DB workload by spinning up 8 little DB nodes. It's not a web server and it just doesn't work like that. Things like CockroachDB exist specifically because normal databases don't work like that.

This is where people usually bring up things like annotations, labels, StatefulSets, etc. First, note that the facilities that accommodate stateful workloads are not priorities for Kubernetes and are generally not well-tested or consistent. This wouldn't be a news story or an independent project if they were.

Second, please realize you're doing all of that work to try and make Kubernetes do something it's not really designed to do, with potential negative impact on the availability and scheduling processes for the applications that do work well on Kubernetes, when you could just spin a VM and avoid all of these issues entirely. There's no reason to put a production DB on k8s other than cargo culting.

smarterclayton · on Dec 30, 2018

As someone who designed kubernetes, I completely disagree. We designed it to run stateful workloads. From V1 we set very strong safety guarantees around how pods work and are scheduled. However, like any software infrastructure, you are vulnerable to a lot of possible failure modes. The kernel can hang on NFS mount disk operations. The SAN can go into a gray failure. A cleaning guy can pull a power cord out. Bad code in the kubelet can result in volumes failing to detach. Random people on the internet can open PRs that remove pod safety protections (happens about once every few months).

Just like any other tool that makes some things easier, Kubernetes also makes it easier to shoot yourself in the foot. Just like any solution, you have to know the system well enough to reason about it. There is still a lot that can be done to improve how we explain, document, and describe the system. But people run stateful workloads on Kube all the time, and they do it because it makes their lives easier on the balance.

ofrzeta · on Dec 30, 2018

If Kubernetes was designed with stateful workloads in mind why did it take until version 1.3 to introduce petSets as an alpha feature?

smarterclayton · on Dec 31, 2018

Because we prioritized stabilizing the core and having something shipped. Pod safety guarantees were part of 1.0 and ensure “at-most-one” pod with a given name at a given time on any node, which allows us to build a compute model that can be used in higher level primitives. Persistent volumes, reserving space in the DNS schema for services to have subnames, and headless services were all designed in specifically so we could do stateful sets.

cpuguy83 · on Dec 30, 2018

A database is an application like any other. Containers are about managing the lifecycle of the process and container managers assist in getting the right state to a container. Wether or not a container has state or not doesn't make it easier or harder to run in a container.

If you aren't managing your state, then yeah you will run into a nightmare when trying to containerize stateful apps... or running them at all. You will literally have the same problems with a VM or physical hardware.

It's important to separate state management from process management. A stateful application is absolutely not harder to contaknerize than a stateless one. Rather it is simply just harder to run stateful applications in any regard.

I would personally argue that it is easier to run a stateful app with a container manager. I know it sounds crazy but... keep in mind container tools are cenetered around what each individual application requires and the tooling tends to make it easier to express and assist in managing the state requirements of that application.

For that matter you can even prevent the scheduler from scheduling your stateful app on a new node, which seems to be the answer for the crux of the argument against containerizing a stateful app.

cookiecaper · on Dec 30, 2018

> Wether or not a container has state or not doesn't make it easier or harder to run in a container.

I agree, which is why I specifically avoided that language. Containers don't have to be implemented without regard for state -- but if you're talking about Docker or k8s, they are. Docker throws away anything not explicitly cemented in the image or designated as an external volume.

LXC, zones, and jails are containerization techniques that respect state. It's fine to run a database in these if desired. They behave just like real VMs; they have an init process, they get real IPs, they don't automatically destroy the data written to them, and they generally don't mysteriously shut down or get rescheduled. You can't be confident about any of that with Docker or k8s.

Statefulness is not a primary use case for Kubernetes. It took two years for StatefulSets to leave beta and there was a substantial false start in PetSets. As recently as April, which is the last time I seriously looked, there were still competing APIs for defining access to local volumes.

If you want to run a production database workload in a jail or a zone, that sounds fine to me. It's not about containerization in the abstract. It's about the way that Kubernetes and Docker do it.

(I mention Docker and k8s together because for most of k8s history Docker was the only supported runtime. It supposedly can use other runtimes now, but they're not widely used afaik, and behave similarly re: state anyway)

smarterclayton · on Dec 30, 2018

StatefulSets are PetSets. We renamed them. The core API designed hasn’t materially changed since the alpha.

cpuguy83 · on Dec 30, 2018

No. That's the point. Docker and k8s provide a means to express your state requirements and splits state management from process management.

The trick is to express your state requirements. And yeah, you will be burned badly if you don't do this... and maybe docs and such should call this out better to make sure people don't set themselves on fire just because they didn't dig in deeply enough.

But docker and k8s do provide a means to assist in managing this state for you (swarm... not so well just b/c the work hasn't been done).

ownagefool · on Dec 31, 2018

So actually kube and docker throwing away your state (that you haven't specifically persisted) is basically a good thing, because it makes you very aware of where your state is.

cookiecaper · on Jan 2, 2019

I'm on board insofar as bosses, regulators, and customers find "heightened awareness of state" an acceptable substitute for the production data that was sacrificed to the cause.

wbl · on Dec 30, 2018

Module the development cycle of push and run a new container being incompatible with state. Sure google containerizes everything but they invested effort.

cpuguy83 · on Dec 30, 2018

Again, separate the app layer from the storage layer.

Why would an app dev be pushing changes to the db deployment (outside of data manipulation itself)?

Just because the app dev wants to spin up a db in dev to shove their data into doesn't mean that's how it should be deployed in prod.

DasIch · on Dec 30, 2018

EC2 instances do go down, EBS volumes fail, hardware fails. Maybe not as frequently as pods in Kubernetes get evicted but at sufficient scale it does occur frequently enough overall that you do need to find ways to handle this automatically without human intervention.

Once you’ve achieved that whether your database runs on a VM or in Kubernetes doesn’t make a difference really.

Granted, if your not at that scale, running a database in Kubernetes is probably not the best of ideas. That has nothing to do with Kubernetes though, that’s because running a stateful service with decent working backup, recovery and automated failover is difficult in any case. If that’s not your job, you’re probably better off using RDS or something equivalent.

ownagefool · on Dec 31, 2018

At the end of the day, when you can give pods in the form of statefulsets static IPs, static names, static labels, indexs, and consistent storage, and give them strong guarantees of running, then I'm not really sure you have a strong argument that it's vastly different from the IaaS layer.

Honestly, it sounds like you're arguing that since the kubernetes API is easier and more accessible to use, then it's more dangerous to run state on that layer. That, and a community attitude of being more willing to accept failure, which some would argue is a good thing, others not so much, but I prefer to subscribe to the thought process discussed in the SRE book that failure is inevitable, and that putting your databases inside their kube equivlent saved toil time and harderns your setup.

That said, I would argue most folks being on cloud anyways should just use a managed postgres, but we're not always on cloud and I don't think claiming putting state in kube it's inherently wrong is fair.

derefr · on Dec 30, 2018

> They won't be reaped and rescheduled onto some other node in the cluster, whether by automated rebalancing processes or by manual `kubectl delete po...` or `kubectl drain`.

I take it you've never managed a large VM hypervisor (e.g. vSphere) cluster. If your VMs aren't being pinned to particular hypervisor nodes by persistent claims on local instance storage or the like, they end up "floating around" on each restart in pretty much the same way k8s containers do. Especially so if you have live VM migration enabled, in which case you're probably doing the equivalent of `kubectl drain` all the time to deprovision and repair hardware.

cookiecaper · on Jan 1, 2019

Funny you should mention that. I run a large vSphere cluster (that I inherited) now -- large meaning several hundred VMs. Live VM migration is different because it happens totally transparently; from the guest's perspective, there is no disruption at all. On k8s, pods are recycled all the time and afaik there is no "live" migration of pods that doesn't involve killing and restarting the process. k8s's "vaporize the pod first" culture is basically the opposite of enterprise-grade hypervisors, which exist in large part to minimize incidents that would require the destruction of state, even in the face of hardware failure.

derefr · on Jan 2, 2019

True enough, though I would posit that k8s’s strategy (no live migration) makes sense if you assume that you’re running k8s on top of a VM cluster that has its own live migration, such that you’ll never need to talk to issue an API call to the k8s manager for hardware-related reasons. In such cases, the only time you’re doing a `kubectl apply` is for release management reasons—and it’s nearly impossible, in the general case, to automatically compute a “live migration” between e.g. two different versions of a deployment where the architecture is shaped differently.

(It’s not impossible in specific cases, mind you. I’m still waiting on tenterhooks for the moment someone introduces an Erlang-node operator where you can apply hot-migration relups through k8s itself.)

manigandham · on Dec 30, 2018

FYI: there is no "reply allocation". HN adds an increasing wait time before you can reply to replies on your posts in a thread to prevent deeply nested rapid fire arguments.

user5994461 · on Dec 30, 2018

EBS support in Kubernetes was considered experimental as of last year. Not sure about now.

I recall a few nasty issues in the GitHub with data loss or unmountable volumes for the early adopters, with the official answer along the lines of "implementation is in progress".

gnufied · on Dec 30, 2018

I do not think EBS support in Kubernetes was experimental in 2017. I am one of the maintainers of in-tree EBS driver and we have tried our best to iron out any bugs reported.

There are still bugs, I do not disagree. Data loss bugs are considered top priority and I am not aware of any open such bugs against EBS driver.

user5994461 · on Dec 31, 2018

Pretty sure the bugs in question were closed.

You'll excuse me but no time to go through the history and dig up the tickets for reference.

slashdev · on Dec 30, 2018

I downvoted you because of that whole conspiracy theory you tacked onto the end of your post.

But I fully agree that kubernetes and containers are not well suited to running production databases. In theory they could achieve parity with a dedicated machine or VM, but they're still a long ways from that - and it makes it very easy to lose your data. I was recovering a database where the persistent volume wasn't setup right and the container got killed and restarted. It was just before the holidays and it was a nightmare because everyone was on vacation.

Yeah you could get into that kind of problem with a VM or dedicated machine, but the bar is a lot higher, you'd need some kind of hardware failure. Kubernetes makes it really easy to shoot yourself in the foot when running databases.

alexandre_m · on Dec 31, 2018

"I was recovering a database where the persistent volume wasn't setup right and the container got killed and restarted."

In other words, your database application was using scratch storage instead of persistent volumes?

What this anecdote shows is that the developers or admins responsible to setup the database didn't do it properly.

Also, testing failures and data recovery should be your priority before going to production.

I don't see how you could blame that on software.

slashdev · on Dec 31, 2018

Not 100% sure that it was using scratch, but something went wrong with the persistence.

The point is not to say it wasn't human error - clearly it was, but it's an error that wouldn't have been as easy to make without kubernetes. There's a cost to running a database on k8s that largely people ignore. That's before you start talking about backups and recovery which also get harder and require more manual work with more potential for error.

markbnj · on Dec 30, 2018

The same good reasons for running any workload on k8s apply to databases as well, it's just that they are more complicated than stateless services due to the (no surprise) state and the clustering/control protocol things that often accompany HA data stores. Kubernetes has the tools available to manage state now, and many (maybe most) databases now offer some support for its native discovery model. So all in all my current preferred strategy for databases is to prefer hosted if its available (cloudsql, elastic db), k8s if it isn't, and vms if it won't work on k8s.

softwaredoug · on Dec 30, 2018

For those terrified of an AWS dominated future, projects like this are crucial. The closer we can get to OSS based push button open source DB cluster in any cloud, the less we need fear AWS will host everything and lock us in to a walled garden of closed source AWS systems.

derefr · on Dec 30, 2018

I feel like "OSS" is a bit of a misnomer in this case. Your DB cluster, to the degree that it's "production-grade", is partially managed by things like automated upgrade migrations, automatic backups, etc. Essentially, some (centralized!) team associated with the "OSS project" is acting as a devops team for the associated deployments of their project. It's almost as if this team had SSH access to each on-site cluster to ensure their continued smooth operation—but since they don't, they have to do all such maintenance in the form of pre-specifying repair/maintenence strategies, and then building expert-knowledge of when to apply those strategies into the DBMS software itself. But it's still a devops team sitting around doing this—not random contributors.

It's a similar thing with e.g. Ubuntu LTS releases. The core distro might be FOSS, but those branches are uniquely the result of a centralized, corporate devops maintainership ensuring that the silent, automatic security and kernel package upgrades go off without a hitch.

To be clear, I’m not saying you can’t join that maintainership; what I’m saying is that, unlike with a regular FOSS library or framework, or even a regular piece of FOSS daemon software like Apache, in the case of a DBMS, the software will only continue to run smoothly for as long as that maintainership is around to keep it running smoothly. There’s no such thing as a useful unmaintained DBMS, FOSS or not.

And, because of that, the “calculus of TCO” for DBMS projects changes a bit. Unlike regular software, where “proprietary” translates to “higher potential TCO” because of switching costs, in the DBMS case, the “proprietary” vs “open” distinction is nothing next to the “big, healthy maintainership” vs “small, ailing maintainership” distinction. Because, if the DBMS loses all its maintainers? Now you’re stuck maintaining it—at the core level—yourself (and learning how to do so in the process) until such time as you can migrate your data away from it.

Personally, for a production-grade DBMS, I’d trust a corporate-backed (or at least sponsored) product over one which is purely a volunteer effort any day.

twic · on Dec 30, 2018

This is one place where Cloud Foundry genuinely shines. Part of the architecture of CF is that you have stateful data services provisioned using BOSH, CF's orchestration tool. BOSH can talk to a range of infrastructure providers (AWS, Azure, GCP, VMware). You tell BOSH what to provision using a 'release', and there are releases for, amongst other things, MySQL [1]:

https://github.com/cloudfoundry/cf-mysql-release

These releases are used in production by Pivotal, and are actively developed to that end, so they are genuinely production-grade. People have thought carefully about resilience, backups, security, etc. BOSH is a bit awkward, and these releases are tightly coupled to CF, but there's some great work in there.

deboflo · on Dec 30, 2018

You are more likely to get locked in with kubernetes than with AWS. It’s easier to migrate out of highly decoupled, well documented systems piece by piece (AWS) than out of monolithic frameworks like k8s.

derefr · on Dec 30, 2018

I don't see what the problem is with being locked into a "monolithic framework" as long as you can run your own copy of it.

You can take as much time as you like to migrate yourself away from k8s if you don't like it any more (physically migrate your system to your own site; pin the k8s version to prevent API changes; then start changing your code to be less coupled to k8s.)

Whereas, if AWS changes and deprecates a feature, you're on their schedule as to how long you have before your service will break.

shaklee3 · on Dec 30, 2018

That makes no sense. Kubernetes leveraged aws primitives (elb) if needed, and at its core, it deploys containers. As long as your application runs in a container, you aren't locked in.

deboflo · on Dec 30, 2018

I think we can agree that Kubernetes does far more than schedule containers, even if “at its core” that’s what it does. How many lines of the 2e6 lines of code k8s project are directly related to scheduling containers? Very few. If a scheduler is all that is needed and you want to use any of the 3 different types of load balancers provided by AWS, a simpler architecture might be just to use AWS ECS. 500 lines of declarative Cloudformation or Terraform will do the job.

shaklee3 · on Dec 30, 2018

What features are you referring to specifically that lock you in? Sure, it's a large project. But most LOC are around being modular and pluggable, and adhering to standards (OCI, CNI, CSI). I can't think of anything that would be particularly difficult to move out of if needed.

deboflo · on Dec 30, 2018

There isn’t sufficient separation between components within Kubernetes for ease of migrating piece by piece away from kubernetes. Documentation also plays an important role in migrations. I once counted the pages of documentation for Kubernetes vs AWS for equivalent functionality (VPC, ECS, Route53, etc) and AWS had 20 pages for every page of Kubernetes.

alexandre_m · on Dec 31, 2018

One of the main value proposition of Kubernetes is that you're not locked in to any vendor or infrastructure providers.

If your application runs on K8s, then it's portable and it doesn't have to be aware of the environment or other system integration (e.g storage).

softwaredoug · on Dec 30, 2018

By “lock in” I should have clarified I meant locked in to a proprietary ecosystem. Certainly being dependent on open source can be problematic if you’re not an active member of the community (or if the “community” is really just one company)

lukeqsee · on Dec 30, 2018

Earlier discussion: https://news.ycombinator.com/item?id=18698759

an-allen · on Dec 30, 2018

I’ve always been troubled by production-grade handling of state in containers - specifically as it pertains to data backup.

This module takes that into account - and defines a “backup k8s object” that will trigger a db dump. But there is still no way to get point in time data recovery/backup that you get from current production-grade managed state providers. Im going to say its production grade if we are using the standards of 10 years ago. Production-grade today, I feel, is a bit more robust.

DasIch · on Dec 30, 2018

https://github.com/zalando-incubator/postgres-operator supports point in time data recovery just fine and is used in production for 100s of databases at Zalando.

pritambarhate · on Dec 30, 2018

It would be good to know the size and scale of these databases.

DasIch · on Dec 30, 2018

I don’t have actual numbers but I did a quick search and most are a few GiB to tens of GiB, although there are a few hundreds of GiB large. In practice size is not the limiting factor, IOPS are because they all use gp2 EBS volumes. Databases that have huge IOPS requirements are still deployed outside of Kubernetes and run in i3 instances. In that case they still use spilo though, so basically the same system for backups and automatic failover as on Kubernetes.

That being said we also have an ElasticSearch operator that is used to deploy ElasticSearch on Kubernetes, there nodes running on i3 instances and the corresponding instance storage is used. Although used in production that’s still very new and sadly not open source.

bogomipz · on Dec 31, 2018

>"In that case they still use spilo though, so basically the same system for backups and automatic failover as on Kubernetes."

What is "spilo"? I am not familiar with this term. Thanks.

DasIch · on Dec 31, 2018

Spilo[1] is a Docker image that provides postgres bundled with Patroni[2].

The postgres-operator I linked earlier but also our setup on AWS (with one image per EC2 instance) uses that to actually run Postgres.

  [1]: https://github.com/zalando/spilo 
  [2]: https://github.com/zalando/patroni

SoylentBob · on Dec 30, 2018

Interesting project! Thanks for sharing.

How does this compare to other community efforts, e.g. Zalandos Patroni project, aside from supporting more databases than just postgres?

mosselman · on Dec 30, 2018

Does anyone know of a docker alternative like this? So something like KubeDB that lets me deploy a production-ready postgres db on docker swarm for example?

cpuguy83 · on Dec 30, 2018

I would not run a database on swarm. It simply does not have the right api's at the cluster level to properly express state requirements.

The original swarm design had some of this but it was pulled just before release for more design work... which was never completed.

I wrote the only storage support currently in swarm, which is the "mounts" api in your service spec...

So, technically you could use swarm to do it, but it will be painful and I don't think any amount of tooling will help until docker includes some support for cluster-aware storage.

I would be happy to hear if people have successfully done this, though!

mosselman · on Dec 30, 2018

Thank you for your reply. Do I understand correctly that the biggest issue is the fact that containers won't run on the same node and you'd thus have storage issues? Would these issues be (partially) mitigated if you'd run postgres on a single node?

keypusher · on Dec 31, 2018

If you are running multiple copies of postgres on a single node, then you have not significantly improved the resiliency of your database to failure, and it still does not solve the state transition problem. What happens when the primary database fails (or the node dies)? Whether it is on this node or another node, you need to have a replica (sync or async) that you can fail over to, preferably in an automated way. Docker swarm is not equipped to handle these transitions for you, at which point you are just running your database in Docker, with no real benefit over running it on actual hardware or a VM, and with significant added complexity.

cpuguy83 · on Dec 30, 2018

If you can express your storage requirements via the existing mounts api, then yeah it's all possible.

The thing to remember is mounts are implemented only at the node level, so there is no cluster-aware storage controller.

bearjaws · on Dec 30, 2018

Funny because I was just baffled by the pricing of HA MongoDB (from formerly mlab), it gets way too pricey way too fast.

When looking at the hardware being provisioned I realized it wasn't even anything too crazy and could be had for 1/4 the price at Linode.

I will definitely be using this in the future.

rmoriz · on Dec 30, 2018

How does it handle PG updates like for example from PG 9 to PG 10?

Volundr · on Dec 30, 2018

Based on my reading of the documentation, it doesn't. So you'd be responsible for taking a backup via pg_dumpall, and restoring it post-upgrade.

rmoriz · on Dec 30, 2018

Thanks for the confirmation. I was not able to find it, either. Strange what use cases are called „production-grade“ nowadays...

elsonrodriguez · on Dec 31, 2018

Anyone that says "Stateful" and "Production Ready" and "Kubernetes" in the same sentence is likely to disappoint you.

geggam · on Dec 30, 2018

Performance tests please ?