On an intuitive level I love message queues. I like how they loosen the coupling between components and the amount of flexibility they provide. A few years ago I even built quite a large system production with RabbitMQ (a scheduling, monitoring and reporting system for distributed tests).
However, I find it extremely difficult to reason about them. What do I mean with that? I mean all the nitty gritty details like performance, bottlenecks, crashing participants, network issues, queues filling up, protocol issues that lead to message cascades and DDoS etc.
I hear you. I've owned a backend service using Azure Service Bus for several years, which recently ran into problems when a new deployment happened. It turns out they the default settings on the bus causes messages to be re-tried ten times if they take longer than 30s to process. That makes small issues escalate into big issues very very quickly.
Yeah, have learnt the hard way over many weekend midnights to know these numbers. This particular problem especially leads to exponential explosions - something needs to be pretty overloaded or computationally expensive to take over 30s (or queue cutoff), and retrying it adds more of the same load and makes other messages slower and therefore causes them to be retried again. Not fun.
However, I find it extremely difficult to reason about them. What do I mean with that? I mean all the nitty gritty details like performance, bottlenecks, crashing participants, network issues, queues filling up, protocol issues that lead to message cascades and DDoS etc.