I wish I could like Traefik, but it really isn't easy.
The use case in our Hackerspace was to dispatch different Docker containers through our wild-card subdomains. Traefik is supposed to also automatically create TLS certificates. I had numerous problems with the Let's Encrypt functionality.
Debugging information is quite cryptic, the documentation seems all over to me, which is even more problematic given the number of breaking changes between 1.x and 2.x versions. The way you automatically configure things through Docker labels means that a simple typo can render your configuration ignored.
Also, plugging in Traefik to complex docker-compose projects such as Sentry or Gitlab is next to impossible, because of networking: whatever I tried, Traefik just couldn't pick up containers and forward to them unless I changed the definition of every single container in the docker-compose to include an extra network. I don't feel this should be this complex.
Sometimes I just feel that we should get back to using Nginx and write our rules manually. While the concept of Traefik is awesome, the way one uses it is extremely cumbersome.
I worked on a project last year where we tried using Traefik on Kubernetes together with Let's Encrypt certs. It worked... sometimes.
We had significant issues with Traefik not allocating or renewing certs, resulting in some painful outages. The worst part was that there was no workaround; when adding a new domain to an ingress, it was completely incomprehensible why Traefik wasn't requesting a cert, or indeed why it wasn't renewing older ones that were close to expiration. We filed GitHub issues with concrete errors, but they were never addressed. At the time, I tried to debug Traefik to understand how it worked and maybe chase down some of those bugs. I don't like to speak ill of other people's code — let's just say that peeking under the covers made me realize perfectly why Traefik was so brittle and buggy.
We eventually ditched Traefik in favour of Google Load Balancer ingresses, combined with Cert-Manager for Let's Encrypt, and this combination worked flawlessly out of the box despite not being a 1.0 release at the time. The beauty of this setup is that the control plane (cert and ingress configuration) is kept separate from the data plane (web server), so the two can be maintained and upgraded/replaced separately.
I second this. It's incredible complex to debug how Traefik understand it's configuration, and also documentation and examples over the internet are very confusing because the version 1.x vs 2.x changes.
Yep. I believe part of the wonkiness comes from the way the configuration is stored. They have this weird design where the config is mapped to key/value stores using an abstraction. You can use a TOML file, YAML file, Etcd, Redis, etc. If you use Let's Encrypt, it also uses this mechanism (e.g. Etcd) to store the state.
It ends up being confusing and brittle, and exposes the underlying store as an API (you can modify Etcd directly and the changes are picked up). There's no intermediate layer that validates or controls the lifecycle of the config or state. You can end up in a situation where you break Traefik by pushing an invalid configuration, for example.
Can't they just take the text based format and create a config tool that reads TOML/YAML and then writes that configuration to etcd, redis or whatever else they support?
Ouch, we’re currently using nginx but recently switched one service to use traefik. I’m so afraid what you describe is what will bite us in the end. I wrote treafik instead of traefik in one of the labels and only noticed it after hours of debugging. When it works, it works great. But to get it in that state..
I see where the op is coming from, but I found the debugging quite easy in practice. If something doesn't work, go to the traefik panel and find the element you're looking for. If it's not there, it's normally fairly obvious.
I actually have the same setup and it's working perfectly fine, even with my IPv4+6 specific address only config + lots of file-based configuration. I absolutely recommend using the TLS challenge with Let's Encrypt.
No problems with Docker (Compose) networks either, but I'm not using it with GitLab because I have enough IPs.
The biggest problem I see is the accumulation of certificates that will all be kept up-to-date, whether in use or not.
I also have a working system that I found very easy (for me) to setup.
Recently it all came crashing down when an old domain I had expired and I was no longer able to update the DNS in Digital Ocean. The one - unused - domain failing stopped Traefik renewing all my certificates. But I'm also on 1.7 still and really should update to 2.x
> Rather than being pre-compiled and linked, however, plugins are executed on the fly by Yaegi, an embedded Go interpreter.
Woof, no thank you.
Go is basically incompatible with any kind of plugin-like dynamic linking. There are basically two reasonable models for doing something like plugins: the HashiCorp model, where plugins are actually separate processes that do some kind of intra-process communication with the host; or the Caddy model, where you select which plugins you want when downloading the binary, and they're built-in at compile time.
Correct me if I'm wrong, but the Caddy model requires curation, doesn't it?
Plugins and scripting languages flourish when they democratize the process of adding features to a piece of code. To have prebuilt binaries you need a build matrix, and the complexity of the build matrix is somewhere between exponential and factorial.
This is a perverse incentive for the curators. The cost has to be justified, and as the friction grows you can only justify the things that you have a strong affinity for. Anything you don't understand or don't like gets voted off the island.
In the best addon ecosystems, the core maintainers put some safety rails on the system so the addons can't do anything too crazy. Then they watch the cream of the crop and start trying to include them in the base functionality (limiting the number of optional features the majority of their users have to manually pick). The hard part here is how to reward the people whose ideas you just co-opted, and I don't have a great answer for that (although money and/or a free license for life is a good start)
> Plugins and scripting languages flourish when they democratize the process of adding features to a piece of code . . . In the best addon ecosystems, the core maintainers put some safety rails on the system so the addons can't do anything too crazy. Then they watch the cream of the crop and start trying to include them in the base functionality (limiting the number of optional features the majority of their users have to manually pick).
Well, it's a cost/benefit judgment call, not a single valuation. And I think for situations like this, if you have to pick a side, it's generally better to pick the exclusionary walled garden over the bazaar -- I think the value of democratization is usually overstated, and the drawbacks underemphasized.
There is the "plugin" package which seems really cool and fits the simplistic style of Go (tbh I haven't tried this module myself, only glanced at the documentation), but it does not work on Windows, which I think is the reason it is not used. The ticket about adding Windows support to the plugin package is one of the highest rated ones on Go's GitHub, yet it is still open.
Any significant downside to the process based model? I haven't benchmarked the memory consumption of minimal a Go process but it should be well below what e.g. a minimal JVM application uses. With the right serialization format IPC can be reasonably efficient.
I really wish Go plugins got some more love from the go team. It looks like this is using Yaegi a Go interpreter, which is probably the only reasonable choice. Go's plugin package requires that the plugin be compiled with exactly the same compiler version as the main binary. So you need to recompile every plugin for every new release, at least if you upgrade the compiler between releases which you often do. It also doesn't work on windows.
Indeed, go plugins were our initial choice (https://github.com/traefik/traefik/pull/1865). But you said everything about how bad/impossible the workflow would have been for users.
Building from scratch a go interpreter was not the easiest way, but this was the best solution regarding the UX.
There was a public doc talking about the golang linker that addressed this issue at the end. My comment at the time and the post can be found here [0]. I guess there's some hope, but I haven't looked into it again, so I don't know whether anything is moving forward or not.
I've been wanting to use Traefik for a long time but there's this security issue[0] that's almost two(!) years old now that's been keeping me from deploying it in production. As far as I can tell, there's still no out-of-the-box solution that's not overly complicated and won't come back to haunt me a year or two from now.
You don't have to deploy traefik with docker. If you want traefik to monitor new docker containers to add routes for them, of course traefik needs to talk to the docker api to do so.
The docker api has no way to control access such that it's not equivalent to root access.
However, there's no real vulnerability. I'm happy to provide you a url hosted by traefik with the docker integration enabled, no docker socket proxy, etc, and if you can manage to actually escalate permissions, I'll give you 500 bucks. But, of course, you can't. That security issue is just a "defense in depth" issue, and it's an issue for docker, not traefik.
This would be like saying "traefik uses the linux kernel api to open files, but the linux kernel requires traefik validate what goes into that api or else it could allow file path traversal"... But traefik does validate filepaths and so no one makes that complaint.
Similarly, traefik does validate that only safe docker api calls are made and works hard to prevent any sort of remote code execution, so the issue is not a security issue, but a defense in depth proposal that is really a feature request for the docker project.
> If you want traefik to monitor new docker containers to add routes for them, of course traefik needs to talk to the docker api to do so.
Yes, but it wouldn't be necessary for the network-facing part of Traefik to talk to the Docker API. There could be a second container (w/o network access) whose only task it is to talk to the Docker socket and generate a config and write that config to a shared volume.
> However, there's no real vulnerability.
In the present situation Traefik (with Docker integration) is effectively running as root. I don't think it's up for debate that this is much worse than just running Traefik as a normal user (outside Docker). Besides, most users expect applications running in Docker containers to be more secure – not less secure – than running them on the bare system.
> This would be like saying "traefik uses the linux kernel api to open files, but the linux kernel requires traefik validate what goes into that api or else it could allow file path traversal"... But traefik does validate filepaths and so no one makes that complaint.
No. This would be like saying "Traefik has full access to the kernel and the entire OS and the only thing preventing a hacker from exploiting this is Traefik validating incoming network requests."
Do you also run your other web servers as root?
> Similarly, traefik does validate that only safe docker api calls are made
This is completely irrelevant. Once a hacker is inside the Traefik process (i.e. can execute code under Traefik's PID), he can access the Docker socket and therefore the entire system as she/he pleases.
> I don't think it's up for debate that this is much worse than just running Traefik as a normal user
I'm not arguing that there's not a better option. In fact, when I say this is a "defense in depth" issue, that's exactly what I mean. It would improve security. It would be better. But there is no active vulnerability to be exploited. If the system works as intended, this doesn't cause any issues, it's only an issue if there are other real vulnerabilities.
> Do you also run your other web servers as root?
Yes, because capsh --cap-add NET_ADMIn is poorly understood and poorly used I run many other servers as root.
Admittedly, many of them are written in other languages such as C and fork workers that drop privileges because the c stdlib (libc) supports that easily.
Traefik is written in go where forking workers that drop privileges is much harder. Go doesn't use libc, and setuid doesn't actually work correctly [0], so of course it doesn't drop privileges like other software written in better languages.
> This is completely irrelevant. Once a hacker is inside the Traefik process (i.e. can execute code under Traefik's PID)
That's the entire point. It requires an attacker to exploit a real security issue, therefore this hardening you're talking about is a defense in depth.
The way you're talking about it, you make it sound like there's an active vulnerability, not just a defense in depth improvement. They're vastly different.
The developers are not ignoring a real vulnerability, and your all-or-nothing stance on this issue is un-nuanced to the point of harming your communication about it.
My previous offer is still on: $500 bucks that you can't exploit this if I link you to a traefik configured in this way you refuse to run it.
I actually think we're completely on the same page. :)
> The way you're talking about it, you make it sound like there's an active vulnerability, not just a defense in depth improvement. They're vastly different.
Please note that I never termed it a vulnerability (whether active or not). I merely called it a security issue – which it was/is because it gives users a false sense of security. I also made that point very clear in my post on Github.
> My previous offer is still on: $500 bucks that you can't exploit this if I link you to a traefik configured in this way you refuse to run it.
Being able to recognize and avoid potential security holes and being able to exploit them in practice are two different things. I think I'm fairly good at the former (in the sense that I'm able to avoid the most common pitfalls) but I have only limited experience with the latter. And while I am aware that in-depth hacking expertise would be very valuable even when examining certain security practices (it's definitely on my to-do list), I don't think it is required to point out basic flaws and potential attack vectors. So as much as I appreciate your offer, I don't think me being or not being able to hack your system implies anything with regards to how secure Traefik is.
>The docker api has no way to control access such that it's not equivalent to root access.
I've thought this was a mistake many years ago. The fact that the docker daemon is running with root privileges is also something they should have solved a long time ago. Docker is pretty pathetic when it comes to security.
Create a private network that only connects Traefik and the proxy, and limit Traefik's access to only the GET requests it needs to operate. Now the socket is only exposed to a local container.
This just adds another layer of indirection. While it improves security, it is not the same as fixing the issue in the first place and making sure that no network-facing part of the system runs as root.
This security issue is not that simple to manage as you probably know. It's mainly due to the fact that there is now way to have authorization on the the docker API. This is not the case on Kubernetes for example where you have RBAC to prevent this kind of issue. We have described this in detail in our documentation, and you have many solutions/workarounds to address this: https://doc.traefik.io/traefik/providers/docker/#docker-api-...
Yeah, I'm surprised that this is such a sticking point. There's nothing that anyone who isn't Docker Inc. can do to fix the problem that, by default, Docker is all or nothing. It would be nice if Docker could expose a read-only endpoint but c'est la vie.
The only solution I've seen/used that wasn't convoluted or brittle is running a little daemon to just shovel container metadata into Consul and going from there.
> This security issue is not that simple to manage as you probably know.
I do think it's simple to manage: As I already mentioned elsewhere, it wouldn't be necessary for the network-facing part of Traefik to talk to the Docker API. There could be a second Traefik container (w/o network access) running a binary called, say, traefik-config-generator whose only task it is to talk to the Docker socket and generate a config and write that config to a shared volume.
EDIT: Oh, I just realized you're the founder of Traefik! Thank you so much for your work! I would really appreciate your opinion on my suggestion – even if you think it's complete BS. :)
You've probably discounted this for some reason already, but why not use something more built for service discovery - e.g. Consul Catalog / k8s / etcd?
I am really happy to finally see them adding some functionality to add custom middlewares to Traefik. However it leaves a bad taste in my mouth that in order to use it, you have to sign up to their new SaaS. Especially when keeping in mind that this is the "most requested feature" of the community.
Using an interpreter locks you into the interpreter's language (e.g., no NodeJS, Python, etc plugins), but you are able to pass around memory and call procedures from the host program directly instead of having to write an IPC shim layer.
I tried, really tried to use traefik for a year. It worked sometimes, the setup was complicated and the community support is very poor.
I eventually moved to caddy (https://caddyserver.com/) and it is fantastic. Works seamlessly and I got all my obvious and not so obvious questions answered.
So the Go middleware is interpreted? Curious how that works with a request lazy/large body reader and a response lazy/large body writer? What kind of overhead is added to each invocation of read/write there?
from what we measured on a gzip compression plugin, only a few percents of overhead. Because the gzip part is compiled. Only the plugin glue is interpreted.
I never built Traefik successfully. The same is for tag v2.3.0.
$ go build ./...
go: finding module for package github.com/traefik/traefik/v2/autogen/genstatic
cmd/traefik/traefik.go:18:2: module github.com/traefik/traefik@latest found (v1.7.26), but does not contain package github.com/traefik/traefik/v2/autogen/genstatic
The only thing keeping me from switching is the removal of "distributed LetsEncrypt" in 2.0. I get that it's a non-issue for k8s setups with cert-manager, but people aren't always using k8s and it's still a feature in the enterprise edition.
Could you elaborate what you mean? While it'd be ideal to store certs in vault, I'm having it run fine in orchestrated containerization with the cert storage on a distributed filesystem.
The use case in our Hackerspace was to dispatch different Docker containers through our wild-card subdomains. Traefik is supposed to also automatically create TLS certificates. I had numerous problems with the Let's Encrypt functionality.
Debugging information is quite cryptic, the documentation seems all over to me, which is even more problematic given the number of breaking changes between 1.x and 2.x versions. The way you automatically configure things through Docker labels means that a simple typo can render your configuration ignored.
Also, plugging in Traefik to complex docker-compose projects such as Sentry or Gitlab is next to impossible, because of networking: whatever I tried, Traefik just couldn't pick up containers and forward to them unless I changed the definition of every single container in the docker-compose to include an extra network. I don't feel this should be this complex.
Sometimes I just feel that we should get back to using Nginx and write our rules manually. While the concept of Traefik is awesome, the way one uses it is extremely cumbersome.