> Or I just use a monotonically increasing ID and track the highest ID I've received in order.
This assumes you can generate monotonically increasing numbers. If you have many clients, now they all need to share a data source and may be performance bound by generating those numbers.
> Actually I don't even have to. It's probably a good thing to do for efficiency, but in principle I can just drop out-of-order messages and wait until they are redelivered, hopefully in the correct order
True (modulo first problem), but efficiency may be necessary here. With many clients, you may end up in a state where only a small fraction of messages get through successfully, and most traffic is unsuccessful, which is bad. This also makes performance commitments hard to maintain as it's perhaps just luck when a client manages to get a message through. Clients also now need more buffering, more state, etc.
>> But none of these solve the product level, or the user experience level, or other higher levels where these issues still crop up.
>I don't understand this point. Do you have some examples?
Let's assume a simple client->server instant messaging app. As a user, if I send a message, I expect that to arrive exactly once. It's going over TCP which is "reliable", but it doesn't stop the HTTP request from failing and needing to be retried. It's using HTTP3, but that doesn't stop the server generating a 503 and needing to retry the POST request (or whatever). Maybe the server puts the message in a message queue, but that connection fails after sending a transaction commit, did it get committed?
Idempotency tokens or an equivalent mechanism do solve this, but there isn't one magic trick to solving it in some base layer technology like TCP, this needs to be solved again and again whenever you have distributed systems.
Also, this isn't just networking. Two processes on a single machine communicating via IPC may be effectively a distributed system! I've got some experience doing this on Android, and it's still hard.
> This assumes you can generate monotonically increasing numbers. If you have many clients, now they all need to share a data source and may be performance bound by generating those numbers.
You can assign an ID to your nodes and let them generate increasing numbers on their own. Node ID decides on a tie, and if one node sees a larger counter value appear, it adjusts its own counter so that it doesn't stay behind:
This is a logical solution but not a full practical solution.
With this approach you'd still need to communicate the current clock number back to clients as otherwise one will get ahead and have all its traffic accepted, and others will fall behind and be unable to get traffic accepted. Even if an error causes a client to bump forwards to retry, by the time it has done that the number it is about to retry with may have been used.
Additionally, the aim is still to get exactly-once delivery, so clients need to be able to differentiate between an error caused by them reusing an ID that was rejected to enforce exactly-once delivery, and an error caused by another client getting that ID.
Basically, this issue is easy to solve with low traffic and reliable persistent storage everywhere, but hard to solve with high traffic, or the acceptance that all persistent storage brings additional reliability challenges.
This assumes you can generate monotonically increasing numbers. If you have many clients, now they all need to share a data source and may be performance bound by generating those numbers.
> Actually I don't even have to. It's probably a good thing to do for efficiency, but in principle I can just drop out-of-order messages and wait until they are redelivered, hopefully in the correct order
True (modulo first problem), but efficiency may be necessary here. With many clients, you may end up in a state where only a small fraction of messages get through successfully, and most traffic is unsuccessful, which is bad. This also makes performance commitments hard to maintain as it's perhaps just luck when a client manages to get a message through. Clients also now need more buffering, more state, etc.
>> But none of these solve the product level, or the user experience level, or other higher levels where these issues still crop up. >I don't understand this point. Do you have some examples?
Let's assume a simple client->server instant messaging app. As a user, if I send a message, I expect that to arrive exactly once. It's going over TCP which is "reliable", but it doesn't stop the HTTP request from failing and needing to be retried. It's using HTTP3, but that doesn't stop the server generating a 503 and needing to retry the POST request (or whatever). Maybe the server puts the message in a message queue, but that connection fails after sending a transaction commit, did it get committed?
Idempotency tokens or an equivalent mechanism do solve this, but there isn't one magic trick to solving it in some base layer technology like TCP, this needs to be solved again and again whenever you have distributed systems.
Also, this isn't just networking. Two processes on a single machine communicating via IPC may be effectively a distributed system! I've got some experience doing this on Android, and it's still hard.