I have worked on lots of software that involves event-driven actions, and apply this concept throughout.
"Need to send a notification email when x condition becomes true". Naive way: during processing, check the condition and call the SendEmail() function. Idempotent way: Run a query that finds all x conditions, join to a list of notifications based on email+id+time, and only if there's no entry, send the notification and save it to the list.
"Send a report at midnight". Naive way: have a cron entry that runs at midnight or run a loop that checks the time, and if midnight, sends the report. Idempotent way: cross-check against a list of reports that should have been sent, and if missing, send it. Run this check every few minutes.
The nice thing about this is it doesn't stop you from also doing the event-based stuff (for "real-time" notifications, for example), so long as it's the secondary approach from design point of view. Ironically when everything is working fine, the event-driven stuff will actually be doing all the work -- but as soon as there's a failure (like, the system happens to reboot at 23:59) you'll be grateful to build this way.
I could write paragraphs about the nuance, different approaches, failure modes and the trade-offs with all this -- suffice to say treating events as optional in event-based systems has served me well.
One of the experiences I had was applying it to provisioning with idempotent shell scripts, combined with a trivial tool like https://github.com/lloeki/apply
Creating a VM in DO would add the basic SSH keys, then just run the thing against the new IP. Boom, VM ready. Want to make existing VMs config up to date? Apply. Boom.
The whole thing ridiculously scaled up in a totally unexpected way, saving hundreds of hours of headaches and mistakes.
Why not ansible/puppet? Check out the rationale in the readme.
This is basically like going from imperative to declarative.
Describe the final end/desired state and let the machine figure out how to get there, progressing through the work until it's caught up even across restarts or other boundaries.
The downside of this approach bares talking about. If you compute the full desired state every few minutes it can be quite expensive depending on how/what/why you are doing it. And it can also be an O(n^2) problem that is fast enough to make it into production and slow enough down the line with more data to wreak havoc.
This was common when I was writing puppet. It was easy to have your entire puppet run take minutes - and want to run every few minutes - while consuming a LOT of cpu time including on a centralised server and not just the end host.
There are of course ways and strategies to this and not every situational falls afoul of it. But it is something you very much need to be aware of.
> If you compute the full desired state every few minutes it can be quite expensive depending on how/what/why you are doing it.
Definitely! And on top of that, you probably want to apply some business logic to what you're doing.
To expand on the example of "Report should run at midnight":
If you run the check regularly, you can constrain the date range you're checking to just since the last time you checked, and store this date either in memory or persisted depending on your use case, technology, and how you want system failures/restarts handled.
Similarly, if the system was down for several days, it might not be relevant to send all missing reports, but instead just the last one.
On the other hand, if "daily reports" are required for auditing purposes, it probably does make sense to send all of them. But you have to be careful so on the very first run it doesn't go "Hey, there are 18,726 missing reports since Jan 1, 1970!"
It's also important to write the report itself to be idempotent. You can't write it so it reports to "current" time with the assumption it runs at exactly midnight. Instead you have to actually constrain the dates to exactly what you want. Even aside from the idempotent invocation I'm talking about, as your system gets big you'll get thousands of reports that all need to run at "midnight" and that just isn't practical: maybe they run all sequentially, and some will not actually execute until several minutes after midnight.
suppose you serve clients who want their reports to be generated at 02:30 local time. you could try to run a cron container for every time zone :bleh:, but with daylight savings, the 2:30 event can fire twice, or not at all!
instead you simply see if there's a report for yesterday, and if there isn't make one.
Is what you are describing not what they call "event-sourcing" ?
Just keep a log of all the events that have already happened (crucial - I have seen "event logs" that were more like "request logs") and to find out the current state of the system - if the notification or report has has been sent - by examining this log.
The lists you are describing kind of sound like an event log of sorts.
It seems to me that this distinction between primary and secondary approach is not really neccessary.
The more structured way to do this is a CQRS framework with Event Sourcing as the underlying data store model. With that you create specialized readmodels that react to events and you can query them for the exact thing you need to know. Once you do something based on that query it will result in new events that update the readmodel. It's a powerful way of designing a system.
Apparently if you want spare cash, little books of this stuff could sell for $5-10. Just became aware myself, setting up to write again.
That’s to say, this comment probably just saved or made me money. I’ve some hands on but clearly not as much as you exhibit. Thanks.
Target audience: Sr SE, reporting to next level up and perhaps offering tradesoffs, or advising peers or reports working on those systems and need a refresher to give a complete answer to a question.
If this worked as a business model then everyone would be doing it. Everyone also includes people who don’t actually know this stuff, but want to make a quick buck, giving out bad advice. And if you don’t already have the experience, then you don’t have the knowledge to weed out bad advice.
The idempotent way is definitely better, but I argue/add that it would be better to have the incoming business requirement phrased appropriately such that the idempotent approach is the first one that comes to mind.
I am not sure how to perform that rewording/transformation in general though or even for the examples above.
Stub your toe a few times until enlightenment forces its way into your field of view.
Seriously though, if you want learn a different way of building applications where loads of these things pop up, look into Actor Model. Even if you are not going to use Actor Model, reading up about it will teach you a lot about "realtime" applications... My main epiphany was that realtime applications do not exist. Rather they cannot exist. Because our reality is not realtime (or our perception of it is always delayed because our brains/awareness is slow) - we can never truly capture the present moment. Data we see is always in the past. Reality is eventually consistent (from our perspective). All things in our reality runs concurrently, stateless and in their own little time bubble. You can also read up on the concepts behind Event Sourcing.
Anyway, it is a huge rabbit hole, but reading up on the ideas behind Actor Model might just blow your mind. I'm convinced it will make a comeback because it is the natural way our reality works and Actor Model mimics it the closest, and it will also work better for multi-core cpu's. For non-tech reading, try The Power of Now by Eckhart Tolle. Then try to reflect on (relational databases) + (actor model) + (event sourcing) + (power of now) + (brain in a vat philosophy) + (try to to figure out at what speed the universe "renders", how fast our brains can think, how much delay there is, what limits do our body's sensors limits impose) + (try to figure out how closely Object Oriented programming can mimic reality (hint: all programming are tiny simulations that try to mimic some tiny part of our reality)) + (read up on the history of time keeping and the huge rabbit hole of calendars, and how most programming languages attempts to handle time (badly for the most part, it's a legit rabbit hole on it's own)).
After a year or two of hard reading, reflection and teachings (and wrecking your head), you may have good insight into the nature of programming and the fallacies of trying to build realtime systems, or trying to mimic our reality in a computer in general. It is insanely complex, we as programmers merely touch a grain of sand of the beach of complexity. We have no idea how simple our most convoluted systems are in the face of nature, yet we convince ourselves that we KNOW how things work and that we are in control at all times.
To add to this part: "My main epiphany was that realtime applications do not exist. Rather they cannot exist."
Most of us think our applications are realtime (never mind the OS schduling cpu time for your code, so inherently not realtime) but this is only because it behaves as expected, while there is not too much load nor too much data. The moment you start having significant computational load or start having too much data (aka the cpu, memory or storage starts choking), or when you start to split data apart (aka distributed applications) then the problems with "realtime" applications starts to become obvious.
If something goes wrong, to fix the immediate problems would be to retry/rerun the simulation, but that might result in duplicating your intents, which results in the simulation and its state as invalid. Then you either (hopefully) have some mechanism to de-duplicate whatever happened more than once or you have to start you simulation from scratch after fixing the data manually.
Or just design the core of the system with Idempotency in-mind from the start so you have a good way to handle such scenarios. Embrace that most things can/should be handled in a eventually-consistent way. Very few things in life has to happen RIGHT NOW (good luck with that).
An interesting spin on realtime systems are computer games, where we refresh a view of the simulation every few milliseconds (aka 60fps). Every frame gets rendered based on the inputs (keyboard/mouse)from just-after the previous frame. So in effect, the current frame being drawn is just a reaction based on previous inputs. So let me ask you this: in a game engine like this, where would you assign the value of "now"? Right after the previous frame? When capturing user input? When rendering the new frame starts? When showing the new frame but before capturing the new inputs? Where is reality's "now" moment in a computer game? Let's ignore the delays between keyboard/mouse input and drawing from CPU/GPU to the physical screen and lets assume this is a single player FPS and not a multiplayer turn-based game. And lets ignore sound processing too - just think of the simplest of game loops - should we consider them as realtime? Ponder that.
> An interesting spin on realtime systems are computer games (...) Ponder that.
My younger self did that, quite a lot, thank you. I never felt comfortable with various approach of intermixing input handling, physics/game logic and rendering - whether to run them sequentially, decouple some or all of them, whether the physics should be "look-ahead" or "look-behind"... I ended up picking the pattern that leads to the most stable and deterministic behavior (look-behind physics in lockstep with input and on fixed update rate, rendering independent and done variable-rate), and stopped thinking about it.
So thanks for mentioning it in this context, it makes me more comfortable to hear from someone else that "now" in games is ill-defined on philosophical level, and I shouldn't have worried about it as much as I did.
Going a little bit further, there's the time between the frame displaying on the screen and it reaching your retina, then the time between that and it reaching your brain in the form of electro-chemical signals, then the time between that and your brain decoding it.
By the time you perceive anything, and even more so by the time you react to it, it's already in the past.
> "Send a report at midnight". Naive way: have a cron entry that runs at midnight or run a loop that checks the time, and if midnight, sends the report. Idempotent way: cross-check against a list of reports that should have been sent, and if missing, send it. Run this check every few minutes.
Ha! You are biased toward the word "Idempotent way". Above - you have totally different steps under "Idempotent way". So, your process is different for each of them. There's no tech differentiator and it should be - as that's how you implement "Idempotent way"
I've set up a bunch of Windows batch scripts similarly, but didn't know the name for it. It's beautiful because failures, crashes, bugs, restarts, etc. barely matter. The next time the system is able to do what it was supposed to do, it does.
Great article. This is really about a mental shift. When given a problem like "dormant customers should be charged", it's natural to model the problem as verbs: "find the dormant customers and charge them".
However, that description relies on implicit state: it assumes a world where all dormant customers have not been charged. As we all know, relying on hidden implicit state is the source of all evil. When you describe your problem using verbs, you are essentially modeling it in terms of a diff between how the world is and how you want it to be. But that diff is only correct if you have a perfectly correct description of how the world currently is.
It's better to take a step back and think of your problem purely in terms of the world you want to end up in as a result: "all dormant customers have been charged". Framing it that way makes questions like "which dormant customers have already been charged?" more obvious to consider.
A declarative programming language should indeed allow you to think mostly about the result you want and not much about the how, but I'd say GP's comment is more about the design phase. The actual implementation could still be highly imperative.
Idempotency is a pretty critical concept in system design, and I think most developers have run into issues related to it even if they aren't directly familiar with the term.
To give another simple example as the OP - Suppose you have a product that relies on time series data. For demo purposes you might create a curated data set to present to clients, but the presenter doesn't want to show data from 2019 as the "most recent"
Naturally, you decide to write a script. Do you
A) Write as script that moves the data forward by 1 week explicitly, and simply run this once per week or
B) Write a script that compares the current date to the data and moves it forward as much as it needs
At first glance, these two approaches work the same, but what if (A) triggers twice? What if it runs once every 6 days by mistake? (B) is idempotent however - subsequent executions won't change the state. It's usually impossible to predict all of the ways that software breaks, but designing with idempotency in mind eliminates a lot of them.
I don't think B is technically idempotent either. Change still occurs but with minimal difference. You cannot cache the results and use them again next week.
An idempotent change would be to pass in the current time instead of checking system time. In this case, as long as the input is the same, the result is the same. You could use cached results, but most likely you want to use new inputs.
Yes but this doesn't matter really. What's important is that running the script twice will still give you right values. From a practical point of vue it is the same as idempotency, which is what matters.
If you design for it from the start it makes your system much less complex. Consider all the errors, special cases, and ultimately data cleanup you need to handle about if your transactions are not idempotent. Idempotency is table stakes for any production app.
Years ago when I was a green programmer filled with p*ss and vinegar - I had to write a system to send a few million emails every day.
That’s when I came across Qmail and studied it’s design. Qmail was designed so well that it literally absolutely would not send an email twice.
But my scheduler did.
To send a few hundred thousand emails more than once is not good is not good for your reputation lol.
It taught me a valuable lesson about Idempotency from an impressionable stage of my career and now, with creating workers running under Sidekiq, Oban and other job schedulers that run very frequently, this has become quite useful.
At one point I used to lean on Redis to “help” me be more idempotent. But with the high throughput of some systems I was working on, even redis with its atomic nature of put and get and handy dandy sets and lists, would have race conditions.
Now I hardly touch redis because my skills in making sure things get processed just once have improved greatly. And I can just use PostgreSQL and my code.
Being Idempotent from the beginning is now the center of my life.
I feel that all of the protections let us chase the problems into smaller areas. Rust's unsafe doesn't eliminate unsafe code, it just means you put it in a small auditable area.
Similarly, some part of the system remains imperative. The network card at least, will always resend a packet. The goal is to pass around idempotent messages except for the very leaves.
For instance, instead of an endpoint 'email(customer, data)' you might have 'email_or_report_on_send_status(customer, data)' and the later endpoint would check the cache for (customer, data) and merely report the previous results if it found them.
I agree though, this stuff used to keep me up at night and eventually I've grown more natural about not mutating things unless I mean it. (This phrase sounds like a comic-book villain.)
This is good but not enough. You also need to be sure that you can’t charge twice if the job runs twice. When you do that same query twice, you will get the same list of users. This could be done by exploiting database consistency rules, like using strongly isolated transactions. One simple more general approach is to use an idempotence token. You could, say, have a table with a uniqueness constraint, and generate IDs that will match for the same user in the same month. Then add that in the same transaction that subtracts the money. The table could be cleaned up periodically.
If you’re making or using an API where repeating would be bad, consider using idempotency tokens for those too. I believe Stripe supports them. The basic idea is the same: if you pass a token into them, they will guarantee that in a certain time frame, no other requests with that ID can be duplicated. This is useful when the network flakes during the response. Is it safe to retry?
Things get trickier when you combine network and database consistency measures; that’s when you get into locks and multi stage commit and etc. and it helps to know your database’s consistency model, since it’s often not as solid as you think! (In the past, even PostgreSQL had issues with providing serializable isolation.)
Have you used a iterator as the token? I had data that could be accessed and mutated from multiple different sources at the same time. With just a token and blocking the data caused race condition (cross server race conditions/deadlocks are just the worst). I solved this by giving an iterator with every read and a write required the same iterator back with the changes. If 2 servers try to write at the same time the first processed will go through and the other will get rejected. The rejected server will reread the data with a new iterator, apply its mutation, then attempt to write. It worked really well for me and I never had any race/deadlock issues.
> This is good but not enough. You also need to be sure that you can’t charge twice if the job runs twice. When you do that same query twice, you will get the same list of users.
That's handled by the extra condition in step 1 of the altered rules:
OP is talking about concurrent execution of the task without serialization on the database (serializable isolation is almost never the default). Which results in 2 concurrent queries getting the same list of customers to charge.
It's a pity the author doesn't get into implementation details as it's quite fun for that specific example because it involves payments.
A couple of key aspects are:
1. The payments API must be idempotent for the Cron to be idempotent. Otherwise they can't safely retry timeout failures.
2. The operations <charge-the-customer> and <mark-the-customer-as-charged> have to be (to loosely use the term) atomic.
For #1, a payment processor should explicitly state idempotency semantics. Look at Stripe [1] for example. They require you to pass Idempotency Key header in charge request. Clients must carefully choose this key to fit the needs for their business flow and use case. For the author since they charge monthly the idempotency key must be such that it should be constant for a given customer and given month. So something like "customerid-YYYY-MM" makes sense. Now no matter which day of the month cron runs the customer is guaranteed to charge at most once.
As for #2 most of data stores now a days support atomic operations so it should be reasonably straight forward. Get-and-set in Redis, start/end transaction with "for update" in MySQL etc.,
Note that #2 is optional if you don't mind spurious retries to Stripe.
Stepping back from the details it can be noticed that for a system/operation to be idempotent all its downstream dependencies should support idempotent operations. In this example, if Stripe weren't idempotent then you have to resort to manual fixes if charge operation times out.
Another observation is, if a downstream system can assure idempotency then don't bother making your operation idempotent. Failed payment? Retry. No need to store if a payment was attempted or not. Make sure the idempotency-key generation logic is robust.
PS: Source - I've been in payments industry for a while so spend a lot of time thinking about and implementing idempotency. The latest one being when my team worked on payments API at Uber. We included a idempotency in the glossary [2] section ;-)
Googler opinions are my own. I work on payments stuff.
Payments at Google defined some specs for payment companies to build to, which allows easy onboarding as a form of payment on google's platform. We have a page talking about idempotency and expected behavior.
I don't think it's that different from other payment companies, like adyen or worldpay, but we attempted to explain our view of idempotency on a single page.
Sorry if it's a bit off topic but are you required by your employer to tell us you work from them and whether your opinion represents them or not? I read such disclaimers a lot about people working in your company, but extremely rarely for any other company.
Another tech giant engineer here. All too often random opinionated comments get taken way out of context and misconstrued as something pertaining to the company they work for. Separate from any NDA concerns, no one wants to have some reporter quoting them as "Google insiders reveal targeting of Republicans". I can't speak for Google employees, but I certainly don't have to preface anything with "this is my own opinion", but I certainly won't say anything that implies that I'm voicing some stance or opinion of my employer.
Google has some wording in our contract thst says we are required to put a disclaimer when we're posting publicly, I don't remember the exact wording, but I always include it on any topic that is even tangentially related to the company.
This is actually a lot better than previous companies. My employee contract when I was at Cisco, basically said I couldn't talk about anything Cisco related on the internet without sign off from upper management or PR or something.
I feel like immutability and a 100% leak-proof layer of domain methods which in turn manage domain mutations would ultimately bring more value than explicitly adding idempotence throughout.
If I have an idempotent method like "CreateCustomerRecord", this can cause a lot of pain for audit features and other aspects of the domain model if it is internally making determinations about whether to actually create or silently skip creation. For me, I would much rather that the method throw an exception if there is a duplicate business key than have it silently complete without taking any actual action. Exceptions indicating attempts at invalid state transitions can be extremely valuable if you have the discipline to create & use them properly.
Generally, seeking idempotence in otherwise mutable methods is a band-aid for when you have broken immutability rules and allowed things to leak out of the sacred garden of unit-tested state machines and other provably-correct code items.
If you should only conditionally execute some method, perhaps the solution is to investigate the caller(s) of the method, rather than attempt to infer the intent of all possible callers within the method itself.
This can create false negatives when a request must be retried due to network failure but actually succeeded because the failure was during the response.
Idempotency is great for "debouncing" requests. If you want to tell difference between identical requests that are different transactions, add a unique transaction id of some kind.
There's a place for idempotency tokens. They're relatively easy to retrofit onto old systems, and occasionally they are the best way to go about making changes idempotent, but they should be a red flag – an indication that you should step back and see if maybe you can redesign an API to make idempotency a natural guarantee rather than something you artificially strap on with a token. As a rule of thumb, I would always mention the idea of adding an idempotency token, and prompt for alternatives, with all stakeholders present.
(In this particular example, I agree that an idempotency token is probably the way to go, as otherwise callers would need to somehow distinguish between their callers trying to create a duplicate account, versus something going wrong with them resulting in duplicate requests. I just have too often seen developers conflate "idempotency" with "use an idempotency token)
The state machine is great in a vacuum but it falls apart when someone trips over the power cord at exactly the wrong time. Thats the benefit of idempotent design, that it is resilient to real world failure modes that dont typically get modeled into the systems you describe.
I can honestly say that Eric is 100% right with his approach. It always leads to less headaches, more flexibility (oh trust me, someone is always gonna have a "but... there's like a special thing that I sometimes have to do" and it breaks some assumptions.
In any case... yeah... let's just say any time you have to be worried "did we already schedule this", really think "can this never care if it was or not? Should be always safe to schedule it again"
I handle the deployment pipeline at my office and I've got a hard rule that every step in the CI chain has to be idempotent. Considering how many points of failure there can be in CI, being able to retry without undoing the mess of the previous deployment is a lifesaver.
I'm assuming a lot of people click on it to see what the word Idempotence means. From the article:
"Idempotence is the property of a software that when run 1 or more times, it only has the effect of being run once."
And the example is, instead of a chron job just running a process once a month or on some other schedule, it runs more frequently but checks if the change has already been made.
(From the latin Idem which means "same" and potence is of course power/potent, so it has the same power/effect however many times you run it)
One of my first jobs was at an investment bank. They had a lot of programs that ran overnight, in a batch fashion. Everything had to be done before the markets opened the next morning. The term they used for idempotency was "free rerun." Being able to rerun any program with no special setup work was a high priority.
The value in programs being a "free rerun" was that every so often the program would barf on a bad bit of data in a record.
The programming environemnt was interpreted BASIC, so if an error occurred the program would print a message on the console and drop to an interactive prompt.
The operators running the batch schedule would see this and call the programmer on call for that night. You'd log in (over dial up at this time) and attach to the process, look at the error, figure out what went wrong, either correct the data or (more likely) skip the record and deal with it the next day. It was more important to have the programs finish on time; individual issues could be dealt with later.
Often you could just start up the program from where it left off, but if things were more screwed up it was important to be able to re-run it without any negative consequence.
Edit: this was ~30 years ago, so my point is that it's not any kind of new idea or something that wasn't recognized long ago.
I hope this example makes it evident that one of the primary innovations of the last 30 years is defaulting to Latin terms so that they are taken more seriously in business and technology circles to acquire ... you know... gravitas.
I used to be an operator in night shift in my twenties and the job was exactly how you said. Good memories. Lots of sleeping at work and some days of panic when shit broke.
And a lot of "secret" scripts that automated a big part of our job.
> And the example is, instead of a chron job just running a process once a month or on some other schedule, it runs more frequently but checks if the change has already been made.
As a property, I think it's even nicer if a script can literally fully run twice and for the outcome to be the same if it only ran once (so skipping the 'did I run before?' check).
Even though this check is useful in general, if you can define your data in such a way if it did somehow run, that this is not destructive / creates incorrect data, it makes the system more robust.
Of course this is not always possible though. For example, if the process results in an email being sent, you need an explicit check to not do that twice.
In situations like these, it's a legitimate goal to implement an idempotent, or "functional" core.
So the goal of your functional core is to fully construct the email, and return it to the caller, who then has the choice to send the email, print it, write it to disk, etc.
The program you deploy looks like this
EmailSender().send_email(construct_email(args))
You can test by implementing a "safe" EmailSender interface, so that you're executing the same code that's in prod.
In general, if a job/function is mutating state deep in the syntax tree (i.e. sending emails in the middle of a batch job), I personally see that as a violation of the Single Responsibility Principle.
Mathematically, it's x^2 = x, which implies x^n = x for all positive integers n.
Nilpotence (x^2 = 0) is also very helpful some times: it's a process which is self-reversing. Like the discrete Fourier transform (if you set up the constants properly).
In mathematics, a self-reversing function is called an involution, and it's f^2 (or f(f) ) = Id, the identity function.
Nilpotence is very different. It says that if you apply your function a certain number of times, you end up with zero no matter what the input is. For example, projection on x axis + 90 ° rotation of a vector is nilpotent.
Depends on if you think of functional notation or matrix operators and linear algebra first. There's also dirac bra-ket notation from QM. I tend to reach for the latter two more than the former.
Particularly in this case O^2 = O makes more sense to me than f(f(x)) = f(x). And I just naturally think about A B - B A = 0 rather than f(g(x)) - g(f(x)) = 0 for commutativity.
It's not that weird. People often write iterated composition as f^k, and this is especially true with matrices, where composition and multiplication mean the same thing.
Careful, for nilpotence the power doesn't have to be 2.
Also you may be confusing it with x^n = 1 (which I'm not sure how to name, 'root of unity' perhaps). This would be the case for the Fourier transform (with n=4).
If x^2 = 0 then applying the Fourier transform twice would null your function, which isn't the case.
I really don't want to go on a totally off-topic tangent, but I feel like this needs to be brought up.
> We need to charge dormant customers a monthly fee so that we don't have to keep their money on our accounting books forever.
That's absolutely and objectively not the reason this happens. Keeping their money "on [your] books" has zero cost associated with it outside of the escheatment[0] process, which can be completely automated. I worked at a company that had to do a decent amount of escheatment for almost every US state, Canada, and Mexico, and for a total company size of under ten people the entire organization probably spent 45 minutes working on it in any given month. It's a trivial amount of time.
This happens to that the company can turn it's customers' money into its money. I don't know enough about to escheatment process to know for sure but I'd imagine regular debits like this would also get around the need to return the money to the estate after a given period of time. So someone puts $500 in their account, gets hit by a bus the next day, and the company gets to take a little bit every month/quarter/year until it's all theirs.
It's a pretty scummy thing to do and I'm surprised the author just takes it at face value that they "have to" do it.
> 1. Query the database to find all dormant accounts with a balance, which haven't been charged the fee this month.
> 2. Charge each of these accounts a fee
> 3. Setup a cron job to run this every hour
Note that if this job ever runs successfully, but takes more than an hour, you will double-count. Can easily happen if the box running these crons is overloaded. One fix is to automatically halt the job after 55 minutes, another would be to have the middle step be impotent, for each user you're doing the process on, ensure (ideally in a threadsafe manner) that they need the operation to be done still.
The article didn't really go into the details of the implementation but my mental image was basically your second proposal: the deduction of the fee should happen in the same database transaction as the updating of a "fee-last-charged" timestamp, relying on the database for thread safety and allowing the cron job itself to be stateless (and inherently protected against simultaneous execution of multiple instances of itself).
If some rogue deploy script creates fifty of the cron job, and weirder things happen every day, one of those could be checking the stale state during the transaction of another one. A classic data race.
Eliminating this problem is left as an exercise for the interested reader...
The way to fix that is to remove state altogether.
1. Balance is not a simple variable, but the sum of all credits and debits to an account
2. A fee is a charge record in your database
3. This fee has a database constraint that you can have only one record per month
Now you can run the script that charges dormant fees as often as you want.
This is exactly the right approach, and the easiest way to implement idempotency in many realistic systems. Instead of thinking about idempotency in verbs - "perform action iff action has not yet been performed" - think about it in nouns - "create a piece of data whose key is the tuple of its inputs".
Practically speaking, this narrows your "transaction window" significantly - instead of:
1. Begin transaction
2. Check to see if work has already been done
3. Do work
4. Persist work
5. Commit
With a potentially long transaction spanning from 1-5, you do this:
1. Do work
2. Persist work to key / table with uniqueness constraint
3. On conflict, do nothing (looks like you already did the work before)
Of course if "Do work" is very expensive, you can bring back in "Check to see if work has already been done" as an optimization, but for many simple CRUD examples, it's actually /cheaper/ to learn that the work has already been done via the conflict check failing than via an explicit pre-flight check.
> 1. Query the database to find all dormant accounts with a balance, which haven't been charged the fee this month.
> 2. Charge each of these accounts a fee.
There's still a race condition between 1. and 2. if you don't atomically check the account hasn't been charged yet as you're charging it.
With long running processes like these it's also pretty much guaranteed to happen if you accidentally run two instances of this process at the same time.
I think an even higher level of thinking is to think thoroughly about the ways the system can fail. Idempotence just happens to be _one_ way of dealing with certain failure modes.
As an engineer, the more exposure and experience you get, the more insight you have about the ways things can fail. Identifying the ways something can fail is the really important step here. You can’t know what failsafe to implement if you don’t actually know how something can fail. But once you do know how something can fail, implementing a proper solution is easy a lot of the time, even if you are not explicitly aware of the concept of idempotence, for instance. Only in some tricky Byzantine edge cases do you need very specific, well-established, track-proven algorithmic solutions.
If you can build a system with ACID 2.0 life gets really easy. You can reason about your system without worrying about ordering, time, 'exactly once' semantics, etc.
Idempotency is usually one of the simplest pieces to implement, and you definitely get a ton of benefit right off the bat - it's worth designing systems from scratch with it in mind.
I had to solve this for charges and interest and I solved it in a more elegant way than described (I think).
Using mutation iterator, last update time and knowing the rate of change over time you can back apply any any calculations that have been missed regardless of how long it’s been since your cron job ran.
I setup a server to run the job on startup and 1 time a day at midnight. This also allows multiple different charges and interests to be applied to the same account. I also set it up to apply this process every time an account was accessed to make sure the accounts were up to date.
This is my ideal type of article on HN. A namable concept, applicable, and without trying to be subversive. It was immediately define, had a real world example, and was short and well written.
It's 2021 and we still don't have scalable computer systems with continuations support - that's why his cronjob does not work reliably in the first manner if that script crashed or the machine reboots, it will run again.
The idempotence is nice and I use it as much as possible, but, his case is very simplistic. Imagine you have a complex data migration process, which crashes in the middle - how you safely continue from where you crashed or safely roll back?
Well, I did not mean a sequential one and only involving the database, but, let's say, you have forks (i.e. parallel executions) and merges (waiting to sme dependent tasks to complete), etc. What if execution halts in the middle of a step and you don't really have transaction support outside of the DB? I typically would use a workflow engine, but rollback is not something workflow engines support either.
PUT requests are intended to be idempotent, which is one of the things that distinguishes them from POST requests. This (in my experience) is most non-CS-backgrounded software developers' exposure to idempotence, but it actually has tons of value pretty much wherever you can apply it. The ability to (sometimes accidentally) do a thing twice and have it leave no unintended consequence is huge.
UPSERTs can be idempotent as well. "If this doesn't exist, create it, and if it does, update it to match this state", implies that running it twice will leave no unintended side effects.
In the last couple years I've come to realize that nearly all POSTs should actually be PUTs, sometimes with a client-provided uniqueness token in the URL (like a UUID) that becomes part of the newly-created resource's ID.
Now that I think about it, all the situations where I'm fine using POST are also idempotent, such as changing an object's metadata (technically PUT or PATCH), or sending a batch of HTTP requests to an endpoint bundled as a single request.
It's a basic property of HTTP "GET". Or it's supposed to be. "GET" is not supposed to change server site state. That's what "POST" is for. This matters if there's a cache in the middle, since caches tend to assume that GET requests are idempotent and can be served from cache. Cloudflare assumes that. POST requests have to go through to the real server.
Totally agree. I wish UPSERT had been invented earlier, or was a more canonical part of SQL than INSERT and UPDATE for this reason.
Also, in web front end programming, there are lots of cases where X needs to cause Y to happen, but Z also needs to cause Y to happen. It's much much simpler if Y is idempotent, rather than X checking if Z has already happened etc.
I'm going to make an absolutist statement for which I will probably be corrected shortly. For any given table, interactions should either be done entirely with INSERT or be done entirely with UPSERT. The two should never mix on a single table, and UPDATE should never be used.
My reasoning is that if a table represents the current state of the world, then any state changes should be made with UPSERT in order to bring the table up to date with the world. If a table represents the history of changes, then that history should be appended to with INSERT, but not modified.
I'm sure that there are other cases that I'm not currently considering, but I'm also a newbie at SQL and would love to be told of them.
I'm probably being stupid here but the majority of updates only involve a limited number of columns. How would you translate a statement like 'UPDATE customers SET surname='Smith' WHERE customer_id = 1234' into an UPSERT?
I would argue that your SQL statement should match the information that a user put in. It would be pretty rare for a user to directly input a single field change and their id. More likely, there would be some "personal information" submission, where all fields are available to edit. I would have a separate "customer_personal_info" table that gets UPSERTed with the entire contents on the submission.
One impact of this is that is forces the changes to be atomic. If an object-oriented interface updates a database whenever properties are changed, then it results in many single-column changes. In the example below, if there is a network failure between the two commands, then changing the name from "Jane Doe" to "John Smith" could result in an unintended intermediate state of "John Doe".
>I would argue that your SQL statement should match the information that a user put in.
I'm skeptical this is desirable even just from a security perspective.
I think you're considering the role of the database in a narrower context than they're used. Certainly the databases I work with, it's a minority of updates (actually a very small minority) that are performed as a result of a user inputting something into a form in the sense you're talking (i.e. updating an entity). Think of how many updates are of the form update this order to change status to shipped or product stock available to X. You wouldn't want to be obligated to pass the entire row of something to an application so it could pass back a single changed column.
This is not to even get into the scenario where an update is applied to a view (which might be a limited result from one or more tables). In that sort of scenario it's not even clear what an UPSERT would even do.
Possibly way OT but this reminds me of a question I had. Lets say I have a task that creates a daily invoice that failed to run for X days, but when it's run again it needs to go back and create X invoices from when it was last run. Is there a name / concept for that?
In general, you should be thinking about the delivery semantics of the systems calling your code. Many very useful callers offer "at least once" delivery guarantees, implying that your system should behave idempotently to their calls.
Great idea if you control all the inputs (I'd also suggest passing in a time instead of relying on time.Now or whatever your language gives you). Sometimes you don't and it's very hard to do.
Every system I write uses some sort of time object/function that can be set to whatever time input is desired. In a lot of cases it’s just the system time with an offset from some synchronization service. It makes testing easier, makes replaying inputs possible, and can make time related features based off local or server time with only changing inputs.
Even better would be to have the hourly cron job add the accounts to deduct the balance from to a queue and have worker services consume those queue messages and do the deduction and database write.
I try to write idempotent software whenever I can. It's usually not much more difficult to make it work and affords so much more flexibility and less worry when it's done.
This is closely related to commutativity, which is another useful property of a system, especially a distributed one, where events commonly arrive out of order.
We had one "special" team member who insisted on everything being idempotent.
This was his only leading principle. Result: absolute chaos - the code aspired to be idempotent, but due to idempotency he avoided thinking problems through and just created a mess of individual functions - each being idempotent, aside from the unavoidable bugs - which didn't form a coherent flow at all.
We did a major refactoring, threw out about all that code, rewrote everything in a logical manner. Now everything is still idempotent, but comprehensible.
TLDR: idempotency is the same snakeoil as the majority of guiding principles: alone, it doesn't help at all. There are lots of other factors to consider, which make the developer/architect role demanding (and fun).
Craftmanship at least, a sense for architecture (better) or understanding the whole picture of the requirements as a team of developers (best) is still required.
I don't see this as any reflection on idempotency as principle (or other principles in general). Building systems poorly, without a plan, and no testing, will result in a bug-riddled mess, regardless of what pattern is being used.
That is a general problem with any principle used as rigid ideology. Almost very principle becomes a problem if applied too dogmatically . This applies to software dev but also others like politics or economics.
Sure, one little hobby horse, e.g. "inversion of control" can run amok to negative effect (looking at you, Java projects with object traces 75 layers deep) but that doesn't make idempotency or inversion of control into bad ideas.
A bit of pragmatism goes a long way, like Python's odd
"Need to send a notification email when x condition becomes true". Naive way: during processing, check the condition and call the SendEmail() function. Idempotent way: Run a query that finds all x conditions, join to a list of notifications based on email+id+time, and only if there's no entry, send the notification and save it to the list.
"Send a report at midnight". Naive way: have a cron entry that runs at midnight or run a loop that checks the time, and if midnight, sends the report. Idempotent way: cross-check against a list of reports that should have been sent, and if missing, send it. Run this check every few minutes.
The nice thing about this is it doesn't stop you from also doing the event-based stuff (for "real-time" notifications, for example), so long as it's the secondary approach from design point of view. Ironically when everything is working fine, the event-driven stuff will actually be doing all the work -- but as soon as there's a failure (like, the system happens to reboot at 23:59) you'll be grateful to build this way.
I could write paragraphs about the nuance, different approaches, failure modes and the trade-offs with all this -- suffice to say treating events as optional in event-based systems has served me well.