The Downsides of C++ Coroutines

GrumpySloth · on Aug 12, 2023

> Just like a normal function arguments are passed using registers and the stack, coroutines are using the same ABI as previously specified, however the code different is vastly different.

> Finally at the end of the function the stack space initially reserved get’s reset to where it was initially when the function first call happens then returns to the caller.

This post could use some editing. I'm having to reread each paragraph several times to figure out its intended meaning. Most sentences are separate paragraphs with careless mistakes that make me feel the author was being chased by someone when writing them and couldn't take a breath.

i-use-nixos-btw · on Aug 12, 2023

It is due a bit of proof reading. There are some readability issues.

That being said, it is a great article. C++ and coroutines is a story that has been going on for a long time, and the result surprised me. In a bad way.

One bit me right from the start. I copied out an example and it crashed, and it turned out (after hours of searching, reading - the compiler and sanitisers sure weren’t any help) that the problem was that I’d inadvertently made a parameter const& (force of habit) and bound a temporary to it.

My answer to this is simply that I choose not to use coroutines. If I can’t force a compilation failure when I do something dumb, that spooks me.

For a feature released in 2020 it has far too many footguns. Ranges was similar when it came to lifetime footguns. It’s just something that makes it hard to take seriously the claims that it is legacy code that is the reason C++ has a bad rap for safety. Coroutines and ranges are modern features that can shoot your foot off if you don’t know the implementation, which is kind of contrary to the point of making a friendly wrapper over it all.

fasterik · on Aug 13, 2023

The compiler won't stop you storing a pointer to a local variable and dereferencing it later, either. That's just the nature of programming without managed memory. Calling a coroutine is essentially the same as temporarily "returning" from the current stack frame, so all of the usual practices around taking pointers apply.

I agree with your conclusion of not using C++ coroutines, though. It seems like the design falls somewhere in the "worst of both worlds". I would rather either use a library that implements coroutines with the minimal amount of C and inline assembly if performance is critical, or some higher level abstraction that works well with all other language features.

vvanders · on Aug 13, 2023

Having seen a number of footguns with references and lambda captures as well where the compiler won't warn/catch I don't think it's unique to coroutines.

We were using coroutines about ~12 years ago in embedded contexts, this was with Lua which has very good support and being a managed language avoids all the footguns here while still allowing very fast interop with native code(at least in the case of LuaJIT).

I hate to drag Rust into every C/C++ conversation but this is one area where the language really keeps you within the guardrails. Callbacks are hard to use correctly in Rust, less because of the language and more just do to the messy lifecycle aspect of them. You can side-step that with shared_ptr/Arc but then you're stepping into memory leak territory when you have a circular reference(and the atomic ref counting isn't cheap either).

pjmlp · on Aug 13, 2023

That is why no sane person should use C or C++ without static analysis, at very least on the CI/CD pipeline, lint wasn't created in 1979, only because Stephen Johnson was bored at Bell Labs.

shadowgovt · on Aug 15, 2023

In most languages I'm familiar with, static analysis is something the compiler does at every build. It's not something left to a separate CI/CD step. Code that isn't statically sound shouldn't be hitting the repo in a way that it gets to the CI/CD step.

C++ has always been a bit of an outlier to me for that reason.

i-use-nixos-btw · on Aug 13, 2023

Yeah, though that’s also kind of my point. You know and I know (having learned the hard way) that the mechanism involves storing parameters and intermediate values in an object that is referred to later. It’s obvious now I know, but the design hides that from the user - they aren’t supposed to care about that. However, the footguns are still present. There should have been language features to prevent temporaries binding to const& for coroutines, but (according to a friend) the language doesn’t distinguish coroutines and subroutines at that level (… or something?)

Same problem with ranges. The footguns are remnants of abstracting something complex with a friendly interface and failing to secure it. It’s great for people who know the implementation, and it’s obvious where memory issues appear - but if you don’t know the implementation then you end up with holes in your feet.

cma · on Aug 13, 2023

Many other things extend lifetime of const& in C++ or else operator overloading wouldn't work, so it can be confusing.

loxias · on Aug 13, 2023

Seconded. It's quite a nice article, held back by the lack of editing.

I live eat and breathe a few deeply technical things (including c++) but am hampered by lack of proficiency at clear communication and expression. Becoming increasingly self-aware of this, seeing a fine article like this about a topic I understand, "in the wild" only underscores the importance. :)

I can feel myself automatically rewriting it, just like how typos jump out.

I wonder if that would be actually useful (a rewrite) to anyone.

cshokie · on Aug 12, 2023

This article hits on a number of interesting points. There is a lot of complexity to be aware of when using C++ coroutines. And a number of “normal” practices become dangerous in them, such as pass by reference.

That said, I think they are still very much worth it. Older asynchronous programming libraries in C++ are so verbose and so much worse than coroutines that it’s an obvious choice to use coroutines.

Also, there is another hazard that the author does not mention in this article: RAII lock wrappers. Holding a lock across suspension points is super dangerous. At best it wastes performance to leave it locked when blocked. At worst it can create deadlocks or corrupt the lock if it is released on a different thread than it was acquired.

meindnoch · on Aug 12, 2023

>RAII lock wrappers

You don't even need coroutines for this to be dangerous. Holding locks over callback invocations is a pet peeve of mine in PR reviews. Callback invocations, like suspension points, can inject arbitrary operations into our code, which can easily break prior invariants, yet look innocuous for the casual reader.

surajrmal · on Aug 13, 2023

I often just add a task to a runtime queue which gets called once the stack fully unwinds to avoid these sorts of issues. You may not be aware of what locks were acquired prior to your current function being called. Reentrant safe code is considerably more challenging. This might have some overhead as callback parameters have to be placed in the heap, but it's usually worthwhile.

rockwotj · on Aug 12, 2023

As someone who writes in C++ and uses coroutines everyday for work, I find for our use case this is actually helpful.

We use seastar.io a thread per core framework and locks are "async" friendly in that they yield for access instead of blocking. Also embracing fully async message passing between threads simplifies the programming model a ton.

commonlisp94 · on Aug 12, 2023

> it’s an obvious choice to use coroutines.

I agree, but the other choice is to have traditional threads of execution that block. This simple strategy has delivered more successful projects than any other.

mgaunard · on Aug 12, 2023

I don't understand how it is any more verbose.

nazcan · on Aug 12, 2023

Because you have to keep explicitly passing state between each callback, rather than just using the same context (which still has the ability to delete things if needed).

mgaunard · on Aug 12, 2023

State capture with lambdas is implicit, and only explicit if you want it to be.

captainmuon · on Aug 13, 2023

A lot of these seem to be downsides of manual memory managment in C++, not stackless coroutines. The same kind of coroutines work fine in Python, Javascript and C#.

Some parts like the one about lazy coroutines seem to only be an issue because coroutines in C++ can theoretically resume on any thread. If they were restricted to the current thread by default (like in Python+Twisted I think?) then you would still be able to use them for many use cases but with less cognitive overhead.

The author seems to prefer stackful coroutines aka green threads, which are essentially just user mode threads. Handoff is implicit deep inside 'blocking' functions. I rarely see the downsides of them discussed, but they have their own problems: They are still threads so you often need locks. You don't know where a function will hand off, and you could accidentially call a really blocking function or run a long computation and ruin responsiveness. And the type of the function no longer reflects if it is blocking or not (the famous colored functions).

pjmlp · on Aug 13, 2023

They don't work just fine in C#, there is a reason why one of ASP.NET architects has written a guide of best practices.

https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/b...

Ironically, C++'s design is heavily related to C#, as the initial proposal was done by Microsoft and shares many of the same ideas, including how to create runtime aware awaitable types.

captainmuon · on Aug 15, 2023

I've never used ASP.NET, I mean they work fine in GUI apps where you click a button, start downloading a file, and then show a message when the file has been downloaded.

Making code nonblocking without threads and without callbacks = happy case for async. Writing multithreaded servers focussed on throughput is a whole other can of worms, which is basically my point.

flohofwoe · on Aug 13, 2023

Isn't this wrong (the "green threads are just threads" part)? The green thread / stack switching implementations I've seen so far all used cooperative multitasking, eg you know exactly where control is handed back to the scheduler and don't need synchronization between green threads - assuming the scheduler keeps all green threads on the same OS thread of course)

Blocking vs non-blocking can be solved with naming conventions, like Sync vs Async suffix (works well in node.js for instance)

(also getting rid of colored functions is a good thing!)

captainmuon · on Aug 15, 2023

I don't know, I've always thought green threads refers to the coding style of Go or Java's Project Loom - you write code that looks like multithreading and call blocking methods like `socket.receive()`. And then deep down in each IO call, there is some magic that suspends the green thread, and resumes it when data is available.

I think the colored function thing is often thoroughly misunderstood. There is a real difference between a function that returns `string` vs. `Future<string>`. It's not arbitrary but just a matter of typing. Languages could have more syntactic sugar to bridge both worlds of course. And you can get rid of the distinction as goroutines etc. show.

But actually I wonder if it would be useful to keep some colors. Maybe you could have an effect system and mark functions as "computationally expensive" / "blocking" vs. "computationally trivial". The compiler would prevent you from calling the blocking functions from the GUI thread, but you could `async` or `go` them to another thread and resume when finished.

singron · on Aug 13, 2023

You have to be careful. Stackless coroutines will only schedule if you call co_await, but a stackful coroutine can be scheduled in the depths of any call.

This can trick 2 parts of your program into thinking they have exclusive access to the same thing at the same time. E.g. they could both grab the same thread local or both enter the critical section of a recursive mutex.

JonChesterfield · on Aug 13, 2023

Coroutines have a performance cost associated with allocating a stack. C++ solved this by shipping something that are not coroutines and branding it coroutines anyway. Sometimes it still needs heap allocation, but of a constant size and smaller thing.

A coroutine is a thread of execution that can yield to another. The scheduler is thus under userspace control.

The C++ thing is syntax sugar over a control flow transform which looks kind of similar, except you have to annotate all the functions in the call tree and can't do anything that can't be desugared to the same runtime system that was there anyway.

The primary downside of C++ coroutines is then clear. They aren't coroutines, and by squatting on the name, make it borderline impossible that C++ will ever have coroutines. This is annoying as they're one of the things which really needs compiler support to do well.

gdwatson · on Aug 13, 2023

C++ solved this by shipping something that are not coroutines and branding it coroutines anyway. … They aren't coroutines, and by squatting on the name, make it borderline impossible that C++ will ever have coroutines.

I am not a C++ person, but I protest this characterization. You obviously know the categories and I suppose you know the history, but I will recount them so that everyone else understands my objection.

O.G. coroutines emerged in a world where subroutines were, by and large, not reëntrant. Function parameters, local variables, and return addresses were static, and there was no call stack in the modern sense. This is why the old timers said that coroutines were a generalization of subroutines; only the jump instruction of the call/return really needed to change.

Once call stacks became common, coroutines fit awkwardly, and people tried to adapt them in various ways.¹ The world has settled on two designs for reconciling coroutines to the call stack: thick and thin coroutines. Thin coroutines allow suspension within the body of the coroutine but not within subroutine calls; this way the size of the coroutine’s state can be known at compile time and its interaction with the stack is relatively clear. Thick coroutines (i.e., green threads) can be suspended within a subroutine call, and thus require their own slices of stack — either separate from the main stack or copied from and to it as the coroutine is suspended and resumed.

Thin coroutines are absolutely coroutines. They are truer to the original definition than thick coroutines are! They are more limited than thick coroutines, true; whether that makes them better or worse is a matter of design trade-offs. But they certainly deserve the name.

[1] Simula 67, to my understanding, treated objects as a kind of coroutine instance where function definitions in the coroutine body became methods that closed over its local variables.

JonChesterfield · on Aug 13, 2023

C++ had two competing designs for "coroutines". One is syntax sugar over a control flow graph rewrite with implicit state for keeping track of where to branch to. The other is syntax sugar over swapping stacks of execution.

The version that shipped can be done in the compiler front end. On the happy path it compiles to zero cost relative to writing the branches by hand. Machine architecture independent.

The version that didn't ship requires language runtime support. It involves allocating memory for the new stack and storing the live registers to it on yield. It's per-platform machine code, with varying overhead depending on how much control the compiler gives over calling conventions. Yield then looks a lot like a function call (and sometimes upsets branch predictors).

The full/stackful/green/thick/etc version works very like a posix thread without the pre-emptive scheduler, and needs language runtime support for exactly the same reasons that pthread_create does. They're zero cost if not used - they don't change the calling convention of other functions - but the yield usually can't be optimised out at compile time if they are used.

Naming things is indeed difficult and definitions do tend to shift over time. However the "C++ has coroutines now" feature box check doesn't bear up under scrutiny if one expects said coroutine to support the same operations that coroutines support in other languages.

gdwatson · on Aug 13, 2023

Thanks for laying out the C++ situation. Your concern makes practical sense.

mkoubaa · on Aug 12, 2023

Experience teaches me that the worst time to use a new design pattern or technique is _right after you learn about it_. The problem in your code base you thought about while learning the pattern was a useful proxy for where it could be applied, but that doesn't mean it's the right fit.

Do it in a scratch refactoring, and wait a week or two before you consider merging it. And make sure you are emotionally as ready to discard as you are to land it.

TillE · on Aug 12, 2023

I agree that you should always be ready and happy to discard or refactor code as needed. Requirements change, your assumptions may be wrong.

But in practice I've more often seen the opposite problem, where organizations end up stuck on C++11 for a decade for no technical reason. It's good to explore the new stuff and eventually adopt what you can use.

bluGill · on Aug 13, 2023

Better than c++98 which a lot of organizations are still on.

tomcam · on Aug 12, 2023

All true, but another trick is to do extensive web searches to see what kind of problems people have had with the new approach.

hannes_s · on Aug 13, 2023

One problem I have with C++ coroutines is the implicit capture of the "this" pointer in member functions. One can also accidentally pass arguments by reference resulting in lifetime issues, but at least the parameter types are explicitly stated.

The post mentions that it might be the caller's or the callee's responsibility to keep the object alive until the end of the coroutine. This is purely based on conventions however, and different libraries might have different conventions. If the caller is responsible for it, extra care needs to be taken whenever the function is called -- the code shown in the post seems fairly complex to me and easy to get wrong. Also, I am not sure how the callee could safely implement keeping the object alive if lazy coroutines are used: Even if the first statement in the coroutine is retaining a strong reference on the object, there might still be a time between the call and the initial resume of the coroutine where the object is destroyed. I think it would have been great to provide explicit capture lists for coroutines, similar to lambdas.

All of this gets especially confusing once you try to use lambdas together with coroutines. AFAIK, C++ lambdas are basically just structs overloading operator(). In a coroutine, only the "this" pointer of the structure is captured and the caller needs to ensure that the object is not only alive, but also at the same memory address until the end of the coroutine. This is very easy to get wrong in my experience.

_eojb · on Aug 12, 2023

Eh, I use them and am quite productive with them. Some of the downsides I don't really buy, for example, the argument regarding allocations. In a typical task engine, you're allocating state per task anyways. Sure you could have custom arenas and such, but you can do that with C++ coroutines also by overriding operator new/delete on the promise object. The lifetime concerns are par for the course when it comes to async stuff (assuming that's how you're using C++ coroutines).

Conscat · on Aug 12, 2023

Generators, which already exist in the stdlib, is an example where we can see heap elision being useful, but is currently unreliable in C++. There is a paper "Explicit Coroutine Allocation" that will likely solve this in C++26. The Clang IR project will also improve HALO for the future of (Clang) C++ projects.

7e · on Aug 12, 2023

It’s in fashion to dislike fibers, but they’re a simpler solution that, IMHO, beats coroutines for the wide majority of cases. Even threads are a better solution for most cases. Coroutines are like the checked exceptions of C++.

jupp0r · on Aug 12, 2023

There is nothing inherently asynchronous about coroutines. You can use them to model concurrency or even parallelism, but that's only a subset of their use cases.

Calavar · on Aug 12, 2023

What are some of the other use cases?

Jtsummers · on Aug 12, 2023

Many coroutine uses are not asynchronous, but synchronous, they block when resumed and do not execute in parallel. This permits cooperative multitasking, versus preemptive (or preemptive with a bunch of locks to imitate cooperative which is, of course, a waste). Since they can, in principle, execute within the same thread (with C++'s implementation and some others you the programmer can send them off to other threads for execution, but that's an explicit choice) this can simplify concurrent system design and execution (in the concurrency is not parallelism sense). In the single threaded case, it's also faster than multithreaded asynchronous code since the context switching (modulo cache misses) is greatly reduced. Especially useful in the case where you want synchrony and not asynchrony.

They're also very useful if you've ever had to create a bare metal multitasking system. Much easier for state management than older style "while (true)" loops with a million state variables so functions can resume via a switch/case as pseudo-coroutines. (Well, easier if you don't have to implement the coroutine mechanism yourself.)

spacechild1 · on Aug 12, 2023

Generators come to my mind.

germandiago · on Aug 12, 2023

I do not understand yet (open to explanations!) what is the difference between stackless and stackful coroutines in the fact that stackless should be cheap and even "collapsable" when nested in lifetimes but if it is not the case... stackful is cheaper.

Are not stackless supposed to be more performant? In which cases? Yes I know their virality, potential heap allocations, etc.

comex · on Aug 12, 2023

Two differences:

First, stackful coroutines use the coroutine stack for everything they do. Stackless coroutines can use the normal thread stack for synchronous calls, and that stack can be shared across any number of coroutines. Per-coroutine allocation is only needed for asynchronous calls.

Second, for stackful coroutines you need to allocate the entire stack up front, and usually you have no way of knowing how much stack might be needed, so you need a conservative upper bound. Normal thread stacks have sizes in megabytes. (That doesn't necessarily correspond to actual memory consumption, since the OS will only reserve physical memory as needed, but the physical reservation for a given stack can only grow, not shrink. And even just allocating the virtual space has a cost.) Most of the time you can get away with stacks that are much smaller, only a few kilobytes, but at the cost of potentially crashing when you've consumed too much stack; it's hard to statically analyze maximum stack usage.

Stackless coroutines will, in general, only allocate memory as needed for each coroutine invocation, so not only are you wasting less memory, you don't have to worry about hitting an arbitrary limit. Allocation elision makes things more complicated since, as the blog post notes, you can end up wasting some memory, but compared to stackful coroutines it's peanuts. But they have the downside that heap allocations and deallocations are expensive; plus, splitting a "stack" of nested calls into separate heap allocations, usually far away from each other in memory, is worse for cache locality.

gpderetta · on Aug 12, 2023

Technically you could manually grow coroutine stacks the same way the kernel does for thread stacks, by mapping on fault and periodically unmapping everything beyond the red zone. But the complexity would be significant and hard to make it efficient without kernel support.

saurik · on Aug 13, 2023

For a while there was an exciting patch for gcc called split stacks that provided a little thunk for every function -- one normally bypassed, but which stackless coroutines could opt in to call -- that would check if more stack had to be allocated, but I think the story was that Go was the primary potential customer for it and they decided to just give up on the dream :(.

gpderetta · on Aug 13, 2023

You can use segmented stacks in c++ just fine I think. I believe boost.coroutine supports it. The problem is the additional overhead and the impossibility to link against any non-split stack code.

alphanullmeric · on Aug 13, 2023

Still simpler than rust.

sshaw · on Aug 12, 2023

"C++" and "Coroutines". Who would have thought.

But, considering the accelerated releases post C++ 11, I guess I'm not surprised.

pjmlp · on Aug 12, 2023

Anyone that has read "Design and Evolution of C++".

mgaunard · on Aug 12, 2023

The best solution remains writing asynchronous code in a way that is explicitly asynchronous.

Who would have thought?

shadowgovt · on Aug 12, 2023

Besides that they are yet another feature in the bloated hash mess that is c++?