Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was curious about this claim, so I tracked it down: https://blog.stenmans.org/theBeamBook/#_distribution_in_a_la...

> Even though Erlang’s asynchronous message-passing model allows it to handle network latency effectively, a process does not need to wait for a response after sending a message, allowing it to continue executing other tasks. It is still discouraged to use Erlang distribution in a geographically distributed system. The Erlang distribution was designed for communication within a data center or preferably within the same rack in a data center. For geographically distributed systems other asynchronous communication patterns are suggested.

Not clear why they make this claim, but I think it refers to how Erlang/OTP handles distribution out of the box. Tools like Partisan seem to provide better defaults: https://github.com/lasp-lang/partisan



I've run dist cross datacenters. Dist works, but you need to have excellent networking or you will have exciting times.

It's pretty clear, IMHO, that dist was designed for local networking scenarios. Mnesia in particular was designed for a cluster of two nodes that live in the same chassis. The use case was a telephone switch that could recover from failures and have its software updated while in use.

That said, although OTP was designed for a small use case, it still works in use cases way outside of that. I've run dist clusters with thousands of nodes, spread across the US, with nodes on east coast, west coast and Texas. I've had net_adm:ping() response times measured in minutes ... not because the underlying latency was that high, but because there was congestion between data centers and the mnesia replication backlog was very long (but not beyond the dist and socket buffers) ... everything still worked, but it was pretty weird.

Re Partisan, I don't know that I'd trust a tool that says things like this in their README:

> Due to this heartbeating and other issues in the way Erlang handles certain internal data structures, Erlang systems present a limit to the number of connected nodes that depending on the application goes between 60 and 200 nodes.

The amount of traffic used by heartbeats is small. If managing connections and heartbeats for connections to 200 other nodes is not small for your nodes, your nodes must be very small ... you might ease your operations burden by running fewer but larger nodes.

I had thought I favorited a comment, but I can't find it again; someone had linked to a presentation from WhatsApp after I left, and they have some absurd number of nodes in clusters now. I want to say on the order of hundreds of thousands. While I was at WhatsApp, we were having issues with things like pg2 that used the global module to do cluster wide locking. If those locks weren't acquired very carefully, it was easy to get into livelock when you had a large cluster startup and every node was racing to take the same lock to do something. That sort of thing is dangerous, but after you hit it once, if you hit it again, you know what to hammer on, and it doesn't take too long to fix it.

Either way, someone who says you can't run a 200 node dist cluster is parroting old wives tales, and I don't trust them to tell you about scalability. Head of line blocking can be an issue in dist, but one has to be very careful to avoid breaking causality if you process messages out of order. Personally, I would focus on making your TCP networking rock solid, and then you don't have to worry about head of line blocking very often.

That said, to answer this from earlier in the thread

> I have read the erlang/OTP doesn’t work well in high latency environments (for example on a mobile device), is that true? Are there special considerations for running OTP across a WAN?

OTP dist is built upon the expectation that a TCP connection between two nodes can be maintained as long as both nodes are running. If that expectation isn't realistic for your network, you'll probably need to use something else, whether that's a custom dist transport, or some other application protocol.

For mobile ... I've seen TCP connections from mobile devices stay connected upwards of 60 days, but it's not very common, iOS and Android aren't built for it. But that's not really an issue, because the bigger issue is Dist has no security barriers. If someone is on your dist, they control all of the nodes in your cluster. There is no way that's a good idea for a phone to be connected into, especially if it's a phone you don't control, that's running an app you wrote to connect to your service --- there's no way to prevent someone from taking your app, injecting dist messages and spawning whatever they want on your server... that's what you're inviting if you use dist.

This application is running dist between BEAM on the phone and Swift on the phone, so lack of a security barrier is not a big issue, and there shouldn't be any connectivity issues between the two sides (other than if it's hard to arrange for dist to run on a unix socket or something)

That said, I think Erlang is great, and if you wanted to run OTP on your phone, it could make sense. You'd need to tune runtime/startup, and you'd need to figure out some way to do UX, and you'd need to be OK with figuring out everything yourself, because I don't think there's a lot of people with experience running BEAM on Android. And you'd need to be ok with hiring people and training them on your stack.


> I had thought I favorited a comment, but I can't find it again; someone had linked to a presentation from WhatsApp after I left, and they have some absurd number of nodes in clusters now.

Maybe one of these links:

https://x.com/colrack/status/1192408832623947776

https://www.youtube.com/watch?v=216NV-odxnE

https://elixirforum.com/t/scaling-federating-cluster-s-at-pl...


Thanks, these are good! PS, always nice to see a comment from you. :)


Your post are always some of my favorite, seriously. Always super informative & friendly :)


Thanks for getting the quote




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: