Not mentioned in TFA, but I'm utterly convinced that the only compelling reason ...

lmm · on Sept 10, 2020

> avoid the per-thread stack memory allocation of 8MB per thread or whatever it is, in order to be able to scale to an extremely large number of concurrent threads/coroutines. You can't do this with threads.

It's 8 K B per thread, so you can scale a thousand times further than you thought. One dark secret of the async movement is that if your goal is C10K (10,000 concurrent clients) then actually bog standard threading will handle that fine these days.

> The better async code and interfaces become, the more it looks like regular old multithreaded code, except there aren't actual threads underlying it. You still need to make sure you serialise access to shared resources, don't share data that shouldn't be shared, etc. All the same considerations as multithreaded code.

Depends what approach you're using. I prefer making an explicit distinction between sync and async functions ( https://glyph.twistedmatrix.com/2014/02/unyielding.html ), so you effectively invert the notion of a "critical section" - instead of marking which sections can't yield, you mark which sections can yield, so your code is safe by default and you can introduce concurrency explicitly as and when you need it for performance, rather than your code being fast-but-unsafe by default and you're expected to fix a bunch of rare nondeterministic bugs with minimal support from your tools, which is how it works in a multithreading world.

mlyle · on Sept 10, 2020

> It's 8 K B per thread, so you can scale a thousand times further than you thought. One dark secret of the async movement is that if your goal is C10K (10,000 concurrent clients) then actually bog standard threading will handle that fine these days.

Default virtual memory allocation for threads on Linux distributions tends to be 8 megabytes. Actual memory used is the peak stack depth used, rounded up a bit. It'd be pretty unusual to only use as little as 8 kilobytes per thread; just the standard per-thread libc context information for concurrency is a few kilobytes, plus at least one page of stack, plus the kernel's information about the thread (which isn't counted against the process)...

Yes, you can spawn thousands of threads on relatively modest hardware; I was spawning thousands of threads a decade ago.

Spawning 5000 bare-minimal python threads that do nil seems to use about 300 megs of ram on my system; real threads that do anything substantial will use a whole lot more, even if their use of the stack depth is intermittent.

Not to mention allocators that cache part of freed heap per-thread, etc.

gpderetta · on Sept 10, 2020

I believe 8k is the size of the per thread stack on the kernel side. This is non-pageable memory, so it will consume physical memory whether it is needed or not, while of course the 8 megabytes is paged in on demand.

mlyle · on Sept 10, 2020

Yah, I'm ignoring the kernel stack and all kernel data structures. User space memory used will be at least a page of stack (reaching up to the maximum amount used in the thread), plus the libc reentrancy data structures, plus per-thread heap caches, etc. It's can all get paged out, but we hardly want that these days.

The key distinction is that the 8MB of stack VM doesn't have a backing until the memory is used in the thread, but afterwards it does forever.

lmm · on Sept 11, 2020

Well, it's forever if you assume that the threads live forever; for a web server it's perfectly practical to have threads that only live for a single request, or to reuse them for multiple requests but not allow a single thread to live longer than say 10 minutes.

mlyle · on Sept 11, 2020

Decent nits, but...

Short-lived threads (for one request) are a performance and scalability disaster; tens of microseconds or worse to spawn and join, contention on important locks, bad for caches, etc. There's not much concurrency when it comes to spawning threads, too.

If you have long-lived threads in a pool, yes, they may not live forever, but you generally have to assume that each thread will end up with a resident stack size equal to the largest stack use: each will get a turn to run the stack-intensive functions.

hombre_fatal · on Sept 10, 2020

I'd argue that the main benefits of day-to-day async programming isn't performance but actually the concurrency patterns that help you sequence your code and resource access in ways that `thread { work() }` could not.

For example, future/result combinators and `await [task1, task2.then(task3)]`.

gpderetta · on Sept 10, 2020

you could (and can in many languages) use futures with threads just fine though.

lmm · on Sept 10, 2020

Not really - it gets very messy because every time you transform a future you have to figure out where you're getting the thread for that transformation to run on.

gpderetta · on Sept 10, 2020

sorry, what transformation? A future is simply a placeholder for something being computed asynchronously. On a threadful design you would simply spawn a thread (or pick one from a thread pool) to handle the computation. Normally your future runtime would handle it for you.

Basically you end up with something similar to the fork-join model.

lmm · on Sept 10, 2020

> sorry, what transformation?

Whenever you want to transform a result that's in a future, e.g. you have a future for a number and want to add 2 to it.

> On a threadful design you would simply spawn a thread (or pick one from a thread pool) to handle the computation.

If you allow yourself to spawn threads everywhere you'll quickly run out of resources. So you have to manage which thread pool you're using where and ensure you're not bringing in priority inversions etc.. It's really not that easy.

> Basically you end up with something similar to the fork-join model.

The fork-join model isn't really a purely thread-based model - the work-stealing technique is pretty much trying to reimplement what async-style code would do naturally.

mlyle · on Sept 10, 2020

> ensure you're not bringing in priority inversions etc..

Remember we're comparing to async/futures, which are not really guaranteed to not starve either. At least with thread pools you can, in theory, manage this well.

lmm · on Sept 11, 2020

> Remember we're comparing to async/futures, which are not really guaranteed to not starve either. At least with thread pools you can, in theory, manage this well.

With async/futures you're giving the runtime control over these decisions, whereas with threads you're managing them yourself, which can be an advantage but only if you don't make errors with that manual control. An async/future runtime can know which tasks are waiting for which other tasks, letting it avoid deadlocks and a lot of possible priority inversions, and the async style naturally lends itself to writing code that's logically end-to-end (on a single "fiber" even as that fiber moves between threads), which means there's less need to balance resources across multiple thread pools.

gpderetta · on Sept 10, 2020

If you want to add 2 to a future, you block and then extract the content and add to it.

Re spawning threads, depending on the semantics of your framework you cana always execute async operations sychronously.

lmm · on Sept 11, 2020

> If you want to add 2 to a future, you block and then extract the content and add to it.

That makes using futures completely pointless. Why not just write blocking code if you're going to block anyway?

gpderetta · on Sept 11, 2020

Pipelining and opportunity for parallelization:

  A = computeA()
  B = computeB()
  C = A+B

Vs

  futA = asyncComputeA()
  futB = asyncComputeB()
  A, B = waitAll(futA, futB)
  C = A+B

lmm · on Sept 11, 2020

That's almost certainly a bad idea in a web server context like this article is talking about. You improve best-case latency when the server's not loaded, but now you're using 3 threads per request to get a less than 2x speedup (and in a bigger example it would be worse), so your scaling behaviour will get worse.

gpderetta · on Sept 11, 2020

always spawning a thread is of course the naive implementation. You can put an upper bound on the number of threads and fallback to synchronous execution of async operations in the worst case (for example inside the wait call).

If your threads are a bit more than dumb os threads (say, an hybrid M:N scheduler) you can do smarter scheduling, including work stealing of course.

lmm · on Sept 14, 2020

Well, as your threads become less like threads and more like a future/async runtime you come closer to the advantages and disadvantages of a future/async runtime, yes.

gpderetta · on Sept 14, 2020

The underlying thread model have always been 'async' in some form under the hood, i.e. at some point there is always a multiplexer/scheduler that schedules continuations. Normally this is inside the kernel, but M:N or purely superspace based thread models have been used for decades.

Really the only difference between the modern async model and other 'threaded' model is its 'stacklessness' nature. This is both a major problem (due to the green/red function issue and not being able to abstract away asynchronicity) and an advantage (due to the guaranteed fixed stack size, and, IMHO overrated, ability to identify yield points).

At the end of the day is always continuations all the way down.

ynik · on Sept 10, 2020

In a multi-threaded context yes, reduced memory usage is the main benefit.

But async/await can also be used for other things! In a C# Windows GUI application, it's normal to use async/await on the UI thread. Your UI can await multiple tasks at the same time, yet you don't need any locks when accessing the UI state; because all your code runs on the UI thread.

This is a really useful programming model made possible by cooperative task-switching via `await` on a single thread.

Here the "await" being explicit is a crucial feature, it allows the programmer to reason about when the shared state might be mutated by other tasks (or maybe by the user clicking cancel while the current task is waiting).

Any pre-emptive task switching adds a lot of additional complexity and isn't really suitable for UI code.

spaetzleesser · on Sept 10, 2020

“Your UI can await multiple tasks at the same time, yet you don't need any locks when accessing the UI state; because all your code runs on the UI thread.”

Is that true? I thought the code after each await runs on a different thread from the code before the await. At least that’s what I have observed when debugging things.

harikb · on Sept 10, 2020

There are also other language implementation styles that get roughly the same benefit without writing async and await all over the codebase. If the implied yield at defined synchronization points coupled with a decent scheduler like in Go would make this all a non-issue [1].

[1] https://journal.stuffwithstuff.com/2015/02/01/what-color-is-...

matheusmoreira · on Sept 10, 2020

> I'm utterly convinced that the only compelling reason for async is to avoid the per-thread stack memory allocation of 8MB per thread

Yes. The point is to use a single thread with one stack to process several tasks. This consumes less memory.

For example, Go currently has a minimum stack size of 2 KiB so a machine with 4 GiB of memory will be able to process less than 2 million goroutines. An event loop uses a single thread with a single stack, reducing memory usage at the cost of complexity.

Asynchronous functions are just like coroutines. The difference is they return to the awaiting caller instead of yielding to another function. The order of execution is determined by the underlying loop.

laurencerowe · on Sept 10, 2020

> I'm utterly convinced that the only compelling reason for async is to avoid the per-thread stack memory allocation of 8MB per thread or whatever it is, in order to be able to scale to an extremely large number of concurrent threads/coroutines.

If it's possible to avoid shared state then I tend to prefer threads, but in the presence of shared state there are good reasons to think that explicit coroutines are easier to reason about than threads or green-threads: https://glyph.twistedmatrix.com/2014/02/unyielding.html

gpderetta · on Sept 10, 2020

but then you need to forfeit parallelism. If you add parallel execution to (shared memory) async, the advantage disappear.

laurencerowe · on Sept 10, 2020

With Python you forfeit parallelism either way due to the GIL.

jnwatson · on Sept 10, 2020

In general, it is far easier to reason about locking with async code, as the number of preemption points is far far lower. As a result, you get very low overhead inter-task communication.

otabdeveloper4 · on Sept 10, 2020

The "8 megabytes" is virtual memory. You're only incrementing a counter in a table, nothing is actually allocated until you actually start using that stack.

dorfsmay · on Sept 10, 2020

Also there is a point where managing many threads become a load on the CPU, which my inderstanding is why Rust is not provinding green threads anymore.

calpaterson · on Sept 10, 2020

Why are you "utterly convinced" of that though? If I run 20 threads in Python, unix top reports resident memory usage to me as 14mb. Why do I see that number instead of 160mb?

Code sample:

https://gist.github.com/calpaterson/ab35377da9275ca3af7072db...

Scarbutt · on Sept 10, 2020

You mean 14MB?

calpaterson · on Sept 10, 2020

Htop shows me "14016" as resident memory and I don't think it is using kilobytes as the unit.

anonymoushn · on Sept 10, 2020

It is using kilobytes as the unit.

calpaterson · on Sept 10, 2020

Yes you're right. What a silly mistake. Normal Python interpreter = 9mb, with 20 threads = 14mb.

Edited the original post.

mlyle · on Sept 10, 2020

Answer to your question was already discussed right here: https://news.ycombinator.com/item?id=24429221

mlyle · on Sept 10, 2020

That's embarrassing.

calpaterson · on Sept 10, 2020

Only for me as the GP was right. :)

mlyle · on Sept 10, 2020

Yah I replied to you. :P

nurettin · on Sept 10, 2020

This explanation works great as long as the underlying implementation of threads in your particular python implementation is ultimately concurrent, but not parallel.

justsomeuser · on Sept 10, 2020

It’s not the only reason.

For me, async code is easier to debug and understand because it is composed of regular functions (tagged with async).

This means if you have a function that is 10 layers deep, you can pause the debugger and see the stack context. Same with your IDE, you can jump to each function.

With threads that 10 function stack would be 10 threads, each requiring some sort of tooling at both software write time and runtime to get the context.

In summary composing a system of just “functions (sync + async)” is easier than vs “functions + threads”

arethuza · on Sept 10, 2020

"With threads that 10 function stack would be 10 threads"

Presumably only if you have one-thread per function - which I guess you could do but not sure if anyone actually writes multi-threaded code that way.

justsomeuser · on Sept 10, 2020

If you wanted each function to wait for IO or an event without blocking a thread you need an event loop, or one thread per function that needs to block to wait on incoming events.

arethuza · on Sept 10, 2020

Well, I guess that is possible but I've never seen multithreaded server side code (in a thread per request type environment) bother - database calls or other IO just block the current thread.

justsomeuser · on Sept 10, 2020

If you down vote tell me how I’m wrong - appreciated!

anonymoushn · on Sept 10, 2020

If you had a thread-per-request, you would see the whole call stack for the request in your debugger, like you do now with async/await and an event loop.

If you had a greenlet-per-request and an event loop, you would see the whole call stack for the request in your debugger, like you do now with async/await and an event loop.

justsomeuser · on Sept 10, 2020

This is true, but:

1. If you are building the web server, the async stack would have the functions of the web server code too. The thread only sees its history from when it was spawned for that request.

2. As an example, if you needed to start 5 async tasks and await them, the async code would keep caller context if you break inside the task. For the thread-request model, you start new threads for each 5 tasks, if you debug-break in those threads you would not get the caller stack context. Or do you?

anonymoushn · on Sept 10, 2020

To be honest, most people writing threaded code just let the 5 tasks run serially. Otherwise they would have the problem you mentioned.

arethuza · on Sept 10, 2020

If there are dependencies between those tasks (i.e. task 2 depends on the result of task 1, task 3 depends on the result of task 2...) what option do you have?

And if there aren't dependencies better to drop messages on a queue and have a completely different process handle those tasks.

anonymoushn · on Sept 10, 2020

Ideally, the tasks would be serialized when dependent on each other and not when not dependent on each other. Dropping them on a queue discards information about the caller, which is what this thread of the discussion is about. Using greenlet or async/await, you can serialize only when necessary and retain information about the caller.

gpderetta · on Sept 10, 2020

hum, I'm missing something. The same call stack you get with async would map to exactly one thread call stack. Sure each thread will get its own call stack, but all related continuations that participate in an async call stack would be owned by the same thread (Assuming the same application design).

justsomeuser · on Sept 10, 2020

My point is that with async both the runtime and the dev tools try to make async call stacks look exactly like sync ones.

This keeps the context of how your code gets into certain states. If you have 10 threads, it’s like you have 10 different processes (without the context of how they are related - which you get with composing functions) which is harder to understand.

Also assuming that each thread does not have an event loop.

heeen2 · on Sept 11, 2020

> I'm utterly convinced that the only compelling reason for async is to avoid the per-thread stack memory allocation of 8MB per thread

I needed to write an interface to a SSE API (tldr: long running socket connection with occasional message arriving as content)

There simply was no way to poll a http connection (e.g. requests) if it had new data. It would block no matter what. So I would have to start writing threaded code, or just use asyncio which seemed much more ergonomic.