How is this result surprising? The point of coroutines isn't to make your code e...

calpaterson · on June 12, 2020

I think it is surprising to a lot of people who do take it as read that async will be faster.

As I describe in the first line of my article I don't think that people who think async is faster have unreasonable expectations. It seems very intuitive to assume that greater concurrency would mean greater performance - at least one some measure.

> When you're dealing with external REST APIs that take multiple seconds to respond, then the async version is substantially "faster" because your process can get some other useful work done while it's waiting.

I'm afraid I also don't think you have this right conceptually. An async implementation that does multiple ("embarrassingly parallel") tasks in the same process - whether that is DB IO waiting or microservice IO waiting - is not necessarily a performance improvement over a sync version that just starts more workers and has the OS kernel scheduler organise things. In fact in practice an async version is normally lower throughput, higher latency and more fragile. This is really what I'm getting at when I say async is not faster.

Fundamentally, you do not waste "3 billion cpu cycles" waiting 1000ms for an external service. Making alternative use of the otherwise idle CPU is the purpose (and IMO the proper domain of) operating systems.

john-radio · on June 12, 2020

> Fundamentally, you do not waste "3 billion cpu cycles" waiting 1000ms for an external service. Making alternative use of the otherwise idle CPU is the purpose (and IMO the proper domain of) operating systems.

Sure, the operating system can find other things to do with the CPU cycles when a program is IO-locked, but that doesn't help the program that you're in the situation of currently trying to run.

> An async implementation that does multiple ("embarrassingly parallel") tasks in the same process - whether that is DB IO waiting or microservice IO waiting - is not necessarily a performance improvement over a sync version that just starts more workers and has the OS kernel scheduler organise things. In fact in practice an async version is normally lower throughput, higher latency and more fragile. This is really what I'm getting at when I say async is not faster.

You're right. "Arbitrary programs will run faster" is not the promise of Python async.

Python async does help a program work faster in the situation that phodge just described (waiting for web requests, or waiting for a slow hardware device), since the program can do other things while waiting for the locked IO (unlike a Python program that does not use async and could only proceed linearly through its instructions). That's the problem that Python asyncio purports to solve. It is still subject to the Global Interpreter Lock, meaning it's still bound to one thread. (Python's multiprocessing library is needed to overcome the GIL and break out a program into multiple threads, at the cost that cross-thread communication now becomes expensive).

quietbritishjim · on June 12, 2020

> unlike a Python program that does not use async and could only proceed linearly through its instructions

This isn't how it works. While Python is blocked in I/O calls, it releases the GIL so other threads can proceed. (If the GIL were never released then I'm sure they wouldn't have put threading in the Python standard library.)

> Python's multiprocessing library is needed to overcome the GIL

This is technically true, in that if you are running up against the GIL then the only way to overcome it is to use multiprocessing. But blocking IO isn't one of those situations, so you can just use threads.

The comparison here is not async vs just doing one thing. It's async vs threads. I believe that's what the performance comparison in the article is about, and if threads were as broken as you say then obviously they wouldn't have performed better than asyncio.

--------

As an aside, many C-based extensions also release the GIL when performing CPU-bound computations e.g. numpy and scipy. So GIL doesn't even prevent you from using multithreading in CPU-heavy applications, so long as they are relatively large operations (e.g. a few calls to multiply huge matrices together would parallelise well, but many calls to multiply tiny matrices together would heavily contend the GIL).

gshulegaard · on June 12, 2020

> > Python's multiprocessing library is needed to overcome the GIL

> No it's not, just use threads.

I just wanted to expand on this a little to describe some of the downsides to threads in Python.

Multi-threaded logic can be (and often is) slower than single-threaded logic because threading introduces overhead of lock contention and context switching. David Beazley did a talk illustrating this in 2010:

https://www.youtube.com/watch?v=Obt-vMVdM8s

He also did a great talk about coroutines in 2015 where he explores threading and coroutines a bit more:

https://www.youtube.com/watch?v=MCs5OvhV9S4&t=525s

In workloads that are often "blocked" like network calls our I/O bound work loads, threads can provide similar benefits to coroutines but with overhead. Coroutines seek to provide the same benefit without as much overhead (no lock contention, fewer context switches by the kernel).

It's probably not the right guidelines for everyone but I generally use these when thinking about concurrency (and pseudo-concurrency) in Python:

- Coroutines where I can.

- Multi-processing where I need real concurrency.

- Never threads.

quietbritishjim · on June 12, 2020

Ah ha! Now we have finally reached the beginning of the conversation :-)

The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO. The point of the article is to dispel that belief. There seems to be some debate about whether the article is really representative, and I'm very curious about that. But then the parent comment to mine took us on an unproductive detour that based on the misconception that Python threads don't work at all. Now your comment has brought up that original belief again, but you haven't referenced the article at all.

gshulegaard · on June 13, 2020

I didn't reference the article because I provided more detailed references which explore the difference between threads and coroutines in Python to a much greater depth.

The point of my comment is to say that neither threads or coroutines will make Python _faster_ in and of themselves. Quite the opposite in fact: threading adds overhead so unless the benefit is greater than the overhead (e.g. lock contention and context switching) your code will actually be net slower.

I can't recommend the videos I shared enough, David Beazley is a great presenter. One of the few people who can do talks centered around live coding that keep me engaged throughout.

> The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO. The point of the article is to dispel that belief.

The disconnect here is that this article isn't claiming that asyncio is not faster than threads. In fact the article only claims that asyncio is not a silver bullet guaranteed to increase the performance of any Python logic. The misconception it is trying to clear up, in it's own words is:

> Sadly async is not go-faster-stripes for the Python interpreter.

What I, and many others are questioning is:

A) Is this actually as widespread a belief as the article claims it to be? None of the results are surprising to me (or apparently some others).

B) Is the article accurate in it's analysis and conclusion?

As an example, take this paragraph:

> Why is this? In async Python, the multi-threading is co-operative, which simply means that threads are not interrupted by a central governor (such as the kernel) but instead have to voluntarily yield their execution time to others. In asyncio, the execution is yielded upon three language keywords: await, async for and async with.

This is a really confusing paragraph because it seems to mix terminology. A short list of problems in this quote alone:

- Async Python != multi-threading.

- Multi-threading is not co-operatively scheduled, they are indeed interrupted by the kernel (context switches between threads in Python do actually happen).

- Asyncio is co-operatively scheduled and pieces of logic have to yield to allow other logic to proceed. This is a key difference between Asyncio (coroutines) and multi-threading (threads).

- Asynchronous Python can be implemented using coroutines, multi-threading, or multi-processing; it's a common noun but the quote uses it as a proper noun leaving us guessing what the author intended to refer to.

Additionally, there are concepts and interactions which are missing from the article such as the GIL's scheduling behavior. In the second video I shared, David Beazley actually shows how the GIL gives compute intensive tasks higher priority which is the opposite of typical scheduling priorities (e.g. kernel scheduling) which leads to adverse latency behavior.

So looking at the article as a whole, I don't think the underlying intent of the article is wrong, but the reasoning and analysis presented is at best misguided. Asyncio is not a performance silver bullet, it's not even real concurrency. Multi-processing and use of C extensions is the bigger bang for the buck when it comes to performance. But none of this is surprising and is expected if you really think about the underlying interactions.

To rephrase what you think I thought:

> The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO.

Is actually more like:

> Asyncio is more efficient than multi-threading in Python. It is also comparatively more variable than multi-processing, particularly when dealing with workloads that saturate a single event loop. Neither multi-threading or Asyncio is actually concurrent in Python, for that you have to use multi-processing to escape the GIL (or some C extension which you trust to safely execute outside of GIL control).

---

Regarding your aside example, it's true some C extensions can escape the GIL, but often times it's with caveats and careful consideration of where/when you can escape the GIL successfully. Take for example this scipy cookbook regarding parallelization:

https://scipy-cookbook.readthedocs.io/items/ParallelProgramm...

It's not often the case that using a C extension will give you truly concurrent multi-threading without significant and careful code refactoring.

camgunz · on June 13, 2020

For single processes you’re right, but this article (and a lot of the activity around asyncio in Python) is about backend webdev, where you’re already running multiple app servers. In this context, asyncio is almost always slower.

willseth · on June 12, 2020

> But blocking IO isn't one of those situations, so you can just use threads.

Threads and async are not mutually exclusive. If your system resources aren't heavily loaded, it doesn't matter, just choose the library you find most appropriate. But threads require more system overhead, and eventually adding more threads will reduce performance. So if it's critical to thoroughly maximize system resources, and your system cannot handle more threads, you need async (and threads).

otabdeveloper4 · on June 12, 2020

> But threads require more system overhead, and eventually adding more threads will reduce performance.

Absolutely false. OS threads are orders of magnitude lighter than any Python coroutine implementation.

dragonwriter · on June 12, 2020

> OS threads are orders of magnitude lighter than any Python coroutine implementation.

But python threads, which have extra weight on top of an cross-platform abstraction layer on top of the underlying OS threads, are not lighter than python coroutines.

You aren't choosing between Python threads and unadorned OS threads when writing Python code.

otabdeveloper4 · on June 13, 2020

You're absolutely right.

I'm pointing out that this is a Python problem, not a threads problem, a fact which people don't understand.

dragonwriter · on June 13, 2020

Everyone has been discussing relative performance of different techniques within Python; there is neither a basis to suggest from that that people don't understand that aspects of that are Python specific, nor a reason to think that that is even particularly relevant to the discussion.

willseth · on June 12, 2020

Okay, then let's do a bakeoff! You outfit a Python webserver that only uses threads, and I'll outfit an identical webserver that also implements async. Server that handling the most requests/sec wins. I get to pick the workload.

js2 · on June 12, 2020

FWIW, I have a real world Python3 application that does the following:

- receives an HTTP POST multipart/form-data that contains three file parts. The first part is JSON.

- parses the form.

- parses the JSON.

- depending upon the JSON accepts/rejects the POST.

- for accepted POSTs, writes the three parts as three separate files to S3.

It runs behind nginx + uwsgi, using the Falcon framework. For parsing the form I use streaming-form-data which is cython accelerated. (Falcon is also cython accelerated.)

I tested various deployment options. cpython, pypy, threads, gevent. Concurrency was more important than latency (within reason). I ended up with the best performance (measured as highest RPS while remaining within tolerable latency) using cpython+gevent.

It's been a while since I benchmarked and I'm typing this up from memory, so I don't have any numbers to add to this comment.

heavyset_go · on June 12, 2020

Each Linux thread has at least an 8MB virtual memory overhead. I just tested it, and was able to create one million coroutines in a few seconds and with a few hundred megabytes of overhead in Python. If I created just one thousand threads, it would take possibly 8 gigs of memory.

otabdeveloper4 · on June 13, 2020

Virtual memory is not memory. You're effectively just bumping an offset, there's no actual allocations involved.

> ...it would take possibly 8 gigs of memory.

No. Nothing is 'taken' when virtual memory is requested.

shk1338 · on June 13, 2020

But have you tried creating one thousand of OS threads and measuring the actual memory usage? If I recall correctly I read some article where it was explained that threads in Linux are not actually claiming their 8MB each so literally. I need to recheck that later.

heavyset_go · on June 13, 2020

You're right, I've read the same. Using Python 3.8, creating 12,000 threads with `time.sleep` as the target clocks in at 200MB residential memory.

Jasper_ · on June 12, 2020

People seem to keep misunderstanding the GIL. It's the Global Interpeter Lock, and it's effectively the lock around all Python objects and structures. This is necessary because Python objects have no thread ownership model, and the development team does not want per-object locks.

During any operation that does not need to modify Python objects, it is safe to unlock the GIL. Yielding control to the OS to wait on I/O is one such example, but doing heavy computation work in C (e.g. numpy) can be another.

ben509 · on June 12, 2020

To clarify that the CPython devs aren't being arbitrary here: There have been attempts at per-object or other fine-grained locking, and they appear to be less performant than a GIL, particularly for the single-threaded case.

Single-threaded performance is a major issue as that's most Python code.

Jasper_ · on June 12, 2020

Yes. I expect generic fine-grained locking, especially per-object leaks, to be less performant for multi-threaded code too, as locks aren't cheap, and even with the GIL, lock overhead could still be worse than a good scheduler.

Any solution which wants to consider per-object locking has to consider removing refcounting, or locking the refcount bits separately, as locking/unlocking objects to twiddle their refcounts is going to be ridiculously expensive.

Ultimately, the Python ownership and object model is not condusive to proper threading, as most objects are global state and can be mutated by any thread.

mrits · on June 12, 2020

Instead of disagreeing with some of your vague assertions I'll just make my own points for people that want to consider using async.

Workers (usually live in a new process) are not efficient. Processes are extremely expensive and subjectively harder for exception handling. Threads are lighter weight..and even better are async implementations that use a much more scalable FSM to handle this.

Offloading work to things not subjective to the GIL is the reason async Python got so much traction. It works really well.

brightball · on June 12, 2020

This is often a point of confusion for people when looking at Erlang, Elixir or Go code. Concurrency beyond leveraging available CPU's doesn't really add any advantage.

On the web when the bulk of your application code time is waiting on APIs, database queries, external caches or disk I/O it creates a dramatic increase in the capacity of your server if you can do it with minimal RAM overhead.

It's one of the big reasons I've always wanted to see Techempower create a test version that continues to increase concurrency beyond 512 (as high as maybe 10k). I think it would be interesting.

camgunz · on June 12, 2020

> On the web when the bulk of your application code time is waiting on APIs, database queries, external caches or disk I/O it creates a dramatic increase in the capacity of your server if you can do it with minimal RAM overhead.

Python doesn't block on I/O.

kirkeby · on June 12, 2020

Of course it does.

camgunz · on June 12, 2020

It releases the GIL.

Edit: sorry I can do better.

If you're using async/await to not block on I/O while handling a request, you still have to wait for that I/O to finish before you return a response. Async adds overhead because you schedule the coroutine and then resume execution.

The OS is better at scheduling these things because it can do it in kernel space in C. Async/await pushes that scheduling into user space, sometimes in interpreted code. Sometimes you need that, but very often you don't. This is in conflict with "async the world", which effectively bakes that overhead into everything. This explains the lower throughput, higher latency, and higher memory usage.

So effectively this means "run more processes/threads". If you can only have 1 process/thread and cannot afford to block, then yes async is your only option. But again that case is pretty rare.

cvlasdkv · on June 12, 2020

From my understanding the primary use of concurrency in Erlang/Elixir is for isolation and operational consistency. Do you believe that not to be the case?

toast0 · on June 12, 2020

The primary use of concurrency in Erlang is modelling a world that is concurrent.

If you go back to the origins of Erlang, the intent was to build a language that would make it easier to write software for telecom (voice) switches; what comes out of that is one process for each line, waiting for someone to pick up the line and dial or for an incoming call to make the line ring (and then connecting the call if the line is answered). Having this run as an isolated process allows for better system stability --- if someone crashes the process attached to their line, the switch doesn't lose any of the state for the other lines.

It turns out that a 1980s design for operational excellence works really well for (some) applications today. Because the processes are isolated, it's not very tricky to run them in parallel. If you've got a lot of concurrent event streams (like users connected via XMPP or HTTP), assigning each a process makes it easy to write programs for them, and because Erlang processes are significantly lighter weight than OS processes or threads, you can have millions of connections to a machine, each with its own process.

You can absolutely manage millions of connections in other languages, but I think Erlang's approach to concurrency makes it simpler to write programs to address that case.

brightball · on June 12, 2020

That's a big topic. The shortest way I can summarize it though:

Immutable data, heap isolated by concurrent process and lack of shared state, combined with supervision trees made possible because of extremely low overhead concurrency, and preemptive scheduling to prevent any one process from taking over the CPU...create that operational consistency.

It's a combination of factors that have gone into the language design that make it all possible though. Very big and interesting topic.

But it does create a significant capacity increase. Here's a simple example with websockets.

https://dockyard.com/blog/2016/08/09/phoenix-channels-vs-rai...

zzzeek · on June 12, 2020

this is true for compiled languages as the ones you mention, but generally does not apply to Python, which as an interpreted language tends to add CPU overhead for even the smallest tasks.

szatkus · on June 12, 2020

CPU can do billions of operations every second. When you have 200ms for every request that overhead is not that large, you're still blocked by I/O.

zzzeek · on June 12, 2020

for local services like databases, real world benchmarks disagree.

szatkus · on June 12, 2020

You should add that you mean just databases. I've just looked at your profile and as I understand it's your focus.

I built a service that was making a lot of requests. Much enough that at some point we've run out of 65k connections limit for basic Linux polling (we needed to switch to kpoll). Some time after that we've ran out of other resources and switching from threads to threads+greenlets really solved our problem.

arghwhat · on June 12, 2020

>... is not necessarily a performance improvement over a sync version that just starts more workers and has the OS kernel scheduler organise things.

This is very true, especially when actual work is involved.

Remember, the kernel uses the exact same mechanism to have a process wait on a synchronous read/write, as it does for a processes issuing epoll_wait. Furthermore, isolating tasks into their own processes (or, sigh, threads), allows the kernel scheduler to make much better decisions, such as scheduling fairness and QoS to keep the system responsive under load surges.

Now, async might be more efficient if you serve extreme numbers of concurrent requests from a single thread if your request processing is so simple that the scheduling cost becomes a significant portion of the processing time.

... but if your request processing happens in Python, that's not the case. Your own scheduler implementation (your event loop) will likely also end up eating some resources (remember, you're not bypassing anything, just duplicating functionality), and is very unlikely to be as smart or as fair as that of the kernel. It's probably also entirely unable to do parallel processing.

And this is all before we get into the details of how you easily end up fighting against the scheduler...

crimsonalucard1 · on June 12, 2020

Yeah except nodejs will beat flask in this same exact benchmark. Explain that.

talideon · on June 12, 2020

CPython doesn't have a JIT, while node.js does. If you want to compare apples to apples, try looking at Flask running on PyPy.

e12e · on June 12, 2020

Ed: after reading the article, I guess it's safe to say that everything below is false :)

---

I'd guess the c++ event loop is more important than the jit?

Maybe a better comparison is quart (with eg uvicorn)

https://pgjones.gitlab.io/quart/

https://www.uvicorn.org/

Or Sanic / uvloop?

https://sanicframework.org/

https://github.com/MagicStack/uvloop

Tronic2 · on June 14, 2020

Plain sanic runs much faster than the uvicorn-ASGI-sanic stack used in the benchmark, and the ASGI API in the middle is probably degrading other async frameworks' performance too. But then this benchmark also has other major issues, like using HTTP/1.0 without keep-alive in its Nginx proxy_pass config (keep-alive again has a huge effect on performance, and would be enabled on real performance-critical servers). https://sanic.readthedocs.io/en/latest/sanic/nginx.html

e12e · on June 15, 2020

Interesting, thank you. I wasn't aware nginx was so conservative by default.

https://nginx.org/en/docs/http/ngx_http_proxy_module.html#pr...

talideon · on June 13, 2020

You're not completely off. There might be issues with async/await overhead that would be solved by a JIT, but also if you're using asyncio, the first _sensible_ choice to make would be to swap out the default event loop with one actually explicitly designed to be performant, such as uvloop's one, because asyncio.SelectorEventLoop is designed to be straightforward, not fast.

There's also the major issue of backpressure handling, but that's a whole other story, and not unique to Python.

My major issue with the post I replied to is that there are a bunch of confounding issues that make the comparison given meaningless.

crimsonalucard1 · on June 13, 2020

The database is the bottleneck. JIT or even C++ shouldn't even be a factor here. Something is wrong with the python implimentation of async await.

talideon · on June 13, 2020

If I/O-bound tasks are the problem, that would tend to indicate an issue with I/O event loop, not with Python and its async/await implementation. If the default asyncio.SelectorEventLoop is too slow for you, you can subclass asyncio.AbstractEventLoop and implement your own, such as buildiong one on top of uvloop. And somebody's already done that: https://github.com/MagicStack/uvloop

Moreover, even if there's _still_ a discrepancy, unless you're profiling things, the discussion is moot. This isn't to say that there aren't problems (there almost certainly are), but that you should get as close as possible to an apples-to-apples comparison first.

crimsonalucard1 · on June 13, 2020

When I talk about async await I'm talking about everything that encompasses supporting that syntax. This includes the I/O event loop.

So really we're in agreement. You're talking about reimplementing python specific things to make it more performant, and that is exactly another way of saying that the problem is python specific.

talideon · on June 13, 2020

No, we're not in agreement. You're confounding a bunch of independent things, and that is what I object to.

It's neither fair nor correct to mush together CPython's async/await implementation with the implementation of asyncio.SelectorEventLoop. They are two different things and entirely independent of one another.

Moreover, it's neither fair nor correct to compare asyncio.SelectorEventLoop with the event loop of node.js, because the former is written in pure Python (with performance only tangentally in mind) whereas the latter is written in C (libuv). That's why I pointed you to uvloop, which is an implementation of asyncio.AbstractEventLoop built on top of libuv. If you want to even start with a comparison, you need to eliminate that confounding variable.

Finally, the implementation matters. node.js uses a JIT, while CPython does not, giving them _much_ different performance characteristics. If you want to eliminate that confounding variable, you need to use a Python implementation with a JIT, such as PyPy.

Do those two things, and then you'll be able to do a fair comparison between Python and node.js.

crimsonalucard1 · on June 14, 2020

Except the problem here is that those tests were bottlenecked by IO. Whether you're testing C++, pypy, libuv, or whatever it doesn't matter.

All that matters is the concurrency model because that application he's running is barely doing anything else except IO and anything outside of IO becomes negligible because after enough requests, those sync worker processes will all be spending the majority of their time blocked by an IO request.

The basic essence of the original claim is that sync is not necessarily better than async for all cases of high IO tasks. I bring up node as a counter example because that async model IS Faster for THIS same case. And bringing up node is 100% relevant because IO is the bottleneck, so it doesn't really matter how much faster node is executing as IO should be taking most of the time.

Clearly and logically the async concurrency model is better for these types of tasks so IF tests indicate otherwise for PYTHON then there's something up with python specifically.

You're right, we are in disagreement. I didn't realize you completely failed to understand what's going on and felt the need to do an apples to apples comparison when such a comparison is not Needed at all.

talideon · on June 14, 2020

No, I understand. I just think that your comparison with _node.js_ when there are a bunch of confounding variables is nonsense. Get rid of those and then we can look at why "nodejs will beat flask in this same exact benchmark".

crimsonalucard1 · on June 14, 2020

> I just think that your comparison with _node.js_ when there are a bunch of confounding variables is nonsense

And I'm saying all those confounding variables you're talking about are negligible and irrelevant.

Why? Because the benchmark test in the article is a test where every single task is 99% bound by IO.

What each task does is make a database call AND NOTHING ELSE. Therefore you can safely say that for either python or Node request less than 1% of a single task will be spent on processing while 99% of the task is spent on IO.

You're talking about scales on the order of 0.01% vs. 0.0001%. Sure maybe node is 100x faster, but it's STILL NEGLIGIBLE compared to IO.

It it _NOT_ Nonsense.

You Do not need an apples to apples comparison to come to the conclusion that the problem is Specific to the python implementation. There ARE NO confounding variables.

talideon · on June 14, 2020

> And I'm saying all those confounding variables you're talking about are negligible and irrelevant.

No, you're asserting something without actual evidence, and the article itself doesn't actually state that either: it contains no breakdown of where the time is spent. You're assuming the issue lies in one place (Python's async/await implementation) when there are a bunch of possible contributing factors _which have not been ruled out_.

Unless you've actually profiled the thing and shown where the time is used, all your assertions are nonsense.

Show me actual numbers. Prove there are no confounding variables. You made an assertion that demands evidence and provided none.

crimsonalucard1 · on June 14, 2020

>Unless you've actually profiled the thing and shown where the time is used, all your assertions are nonsense.

It's data science that is causing this data driven attitude to invade peoples minds. Do you not realize that logic and assumptions take a big role in drawing conclusions WITHOUT data? In fact if you're a developer you know about a way to DERIVE performance WITHOUT a single data point or benchmark or profile. You know about this method, you just haven't been able to see the connections and your model about how this world works (data driven conclusions only) is highly flawed.

I can look at two algorithms and I can derive with logic alone which one is O(N) and which one is O(N^2). There is ZERO need to run a benchmark. The entire theory of complexity is a mathematical theory used to assist us at arriving AT PERFORMANCE conclusions WITHOUT EVIDENCE/BENCHMARKS.

Another thing you have to realize is the importance of assumptions. Things like 1 + 1 = 2 will remain true always and that a profile or benchmark ran on a specific task is an accurate observation of THAT task. These are both reasonable assumptions to make about the universe. They are also the same assumptions YOU are making everytime you ask for EVIDENCE and benchmarks.

What you aren't seeing is this: The assumptions I AM making ARE EXACTLY THE SAME: reasonable.

>you're asserting something without actual evidence, and the article itself doesn't actually state that either: it contains no breakdown of where the time is spent

Let's take it from the top shall we.

I am making the assumption that tasks done in parallel ARE Faster than tasks done sequentially.

The author specifically stated he made a server that where each request fetches a row from the database. And he is saying that his benchmark consisted of thousands of concurrent requests.

I am also making the assumption that for thousands of requests and thousands of database requests MOST of the time is spent on IO. It's similar to deriving O(N) from a for loop. I observe the type of test the author is running and I make a logical conclusion on WHAT SHOULD be happening. Now you may ask why is IO specifically taking up most of the time of a single request a reasonable assumption? Because all of web development is predicated on this assumption. It's the entire reason why we use inefficient languages like python, node or Java to run our web apps instead of C++, because the database is the bottleneck. It doesn't matter if you use python or ruby or C++, the server will always be waiting on the db. It's also a reasonable assumption given my experience working with python and node and databases. Databases are the bottleneck.

Given this highly reasonable assumption, and in the same vein as using complexity theory to derive performance speed, it is highly reasonable for me to say that the problem IS PYTHON SPECIFIC. No evidence NEEDED. 1 + 1 = 2. I don't need to put that into my calculator 100 times to get 100 data points for some type of data driven conclusion. It's assumed and it's a highly reasonable assumption. So reasonable that only an idiot would try to verify 1 + 1 = 2 using statistics and experiments.

Look you want data and no assumptions? First you need to get rid of the assumption that a profiler and benchmark is accurate and truthful. Profile the profiler itself. But then your making another assumption: The profiler that profiled the profiler is accurate. So you need to get me data on that as well. You see where this is going?

There is ZERO way to make any conclusion about anything without making an assumption. And Even with an assumption, the scientific method HAS NO way of proving anything to be true. Science functions on the assumption that probability theory is an accurate description of events that happen in the real world AND even under this assumption there is no way to sample all possible EVENTS for a given experiment so we can only verify causality and correlations to a certain degree.

The truth is blurry and humans navigate through the world using assumptions, logic and data. To intelligently navigate the world you need to know when to make assumptions and when to use logic and when data driven tests are most appropriate. Don't be an idiot and think that everything on the face of the earth needs to be verified with statistics, data and A/B tests. That type of thinking is pure garbage and it is the same misguided logic that is driving your argument with me.

talideon · on June 15, 2020

Buddy, you can make all the "logical arguments" you want, but if you can't back up them up with evidence, you're just making guesses.

jinglebells · on June 12, 2020

Nodejs is faster than Python as a general rule, anyway. As I understand, Nodejs compiles Javascript, Python interprets Python code.

I do a lot of Django and Nodejs and Django is great to sketch an app out, but I've noticed rewriting endpoints in Nodejs directly accessing postgres gets much better performance.

Just my 2c

arghwhat · on June 12, 2020

CPython, the reference implementation, interprets Python. PyPy interprets and JIT compiles Python, and more exotic things like Cython and Grumpy statically compiles Python (often through another, intermediate language like C or Go).

Node.js, using V8, interprets and JIT compiles JavaScript.

Although note that, while Node.js is fast relative to Python, it's still pretty slow. If you're writing web-stuff, I'd recommend Go instead for casually written, good performance.

1337shadow · on June 13, 2020

The compare between Django against no-ORM is a bit weird given that rewriting your endpoint in python without Django or ORM would also have produced better results I suppose.

crimsonalucard1 · on June 13, 2020

Right but this test focused on concurrent IO. The bottleneck is not the interpreter but the concurrency model. It doesn't matter if you coded it in C++, the JIT shouldn't even be a factor here because the bottleneck is IO and therefore ONLY the concurrency model should be a factor here. You should only see differences in speed based off of which model is used. All else is negligible.

So you have two implementations of async that are both bottlenecked by IO. One is implemented in node. The other in python.

The node implementation behaves as expected in accordance to theory meaning that for thousands of IO bound tasks it performs faster then a fixed number of sync worker threads (say 5 threads).

This makes sense right? Given thousands of IO bound tasks, eventually all 5 threads must be doing IO and therefore blocked on every task, while the single threaded async model is always context switching whenever it encounters an IO task so it is never blocked and it is always doing something...

Meanwhile the python async implementation doesn't perform in accordance to theory. 5 async workers is slower then 5 sync workers on IO bound tasks. 5 sync workers should eventually be entirely blocked by IO and the 5 async workers should never be blocked ever... Why is the python implementation slower? The answer is obvious:

It's python specific. It's python that is the problem.

arghwhat · on June 12, 2020

JIT compiler.

crimsonalucard1 · on June 13, 2020

Bottleneck is IO. Concurrency model should be the limiting factor here.

NodeJS is faster than flask because of the concurrency model and NOT because of the JIT.

The python async implementation being slower than the python sync implementation means one thing: Something is up with python.

The poster implies that with the concurrency model the outcome of these tests are expected.

The reality is, these results are NOT expected. Something is going on specifically with the python implementation.

nurettin · on June 13, 2020

You mean express.js ?

crimsonalucard1 · on June 13, 2020

NodeJS primitives are enough to produce the same functionality as flask without the need for an extra framework.

wongarsu · on June 12, 2020

Async IO was in large part a response to "how can my webserver handle xx thousand connections per second" (or in the case of Erlang "how do you handle millions of phone calls at once"). Starting 15 threads to do IO works great, but once you wait for hundreds of things at once the overhead from context switching becomes a problem, and at some point the OS scheduler itself becomes a problem

tijsvd · on June 12, 2020

Not really. At least on Linux, the scheduler is O(1). There is no difference between one process waiting for 10k connections, or 10k processes waiting for 1 each. And there is hardly a context switch either, if all these 10k processes use the same memory map (as threads do).

I've tested this extensively on Linux. There is no more CPU used for threads vs epoll.

On the other hand, if you don't get the epoll imementation exactly right, you may end up with many spurious calls. E.g. simply reading slow data from a socket in golang on Linux incurs considerable overhead: a first read that is short, another read that returns EWOULDBLOCK, and then a syscall to re-arm the epoll. With OS threads, that is just a single call, where the next call blocks and eventually returns new data.

Edit: one thing I haven't considered when testing is garbage collection. I'm absolutely convinced that up to 10k connections, threads or async doesn't matter, in C or Rust. But it may be much harder to do GC over 10k stacks than over 8.

dathinab · on June 12, 2020

I recently have read a block with benchmarks doing that for well written C in their use case async io only becomes faster then using threads from around 10k parallel connections. (Through the difference was negligible).

This seems to also be a major behind io_uring.

staticassertion · on June 12, 2020

I don't think this is true? At least, I've never seen the issue of OS threads be that context switching is slow.

The issue is memory usage, which OS threads take a lot of.

Would userland scheduling be more CPU efficient? Sure, probably in many cases. But I don't think that's the problem with handling many thousands of concurrent requests today.

rzk · on June 12, 2020

> is not necessarily a performance improvement over a sync version that just starts more workers and has the OS kernel scheduler organise things

Co-routines are not necessarily faster than threads, but they yield to a performance improvement if one has to spin thousands of them : they have less creation overhead and consume less RAM.

tijsvd · on June 12, 2020

> Co-routines are not necessarily faster than threads, but they yield to a performance improvement if one has to spin thousands of them : they have less creation overhead and consume less RAM.

This hardly matters when spinning up a few thousand threads. Only memory that's actually used is committed, one 4k page at a time. What is 10MB these days? And that is main memory, while it's much more interesting what fits in cache. At that point it doesn't matter if your data is in heap objects or on a stack.

Add to that the fact that Python stacks are mostly on the heap, the real stack growing only due to nested calls in extensions. It's rare for a stack in Python to exceed 4k.

dullgiulio · on June 12, 2020

Languages that to green threads don't do them for memory savings, but to save on context switches when a thread is blocked and cannot run. System threads are scheduled by the OS, green threads my the language runtime, which saves a context switch.

tijsvd · on June 12, 2020

Green threads are scheduled by the language runtime and by the OS. If the OS switches from one thread to another in the same process, there is no context switch, really, apart from the syscall itself which was happening anyway (the recv that blocks and causes the switch). At least not on Linux, where I've measured the difference.

crimsonalucard1 · on June 12, 2020

This is not what is happening with flask/uwsgi. There is a fixed number of threads and processes with flask. The threads are only parallel for io and the processes are parallel always.

viscanti · on June 12, 2020

Which is fine until you run out of uwsgi workers because a downstream gets really slow sometime. The point of async python isn't to speed things up, it's so you don't have to try to guess the right number of uwsgi workers you'll need in your worst case scenario and run with those all the time.

crimsonalucard1 · on June 13, 2020

Yep and this test being shown is actually saying that about 5 sync workers acting on thousands of requests is faster then python async workers.

Theoretically it makes no sense. A Task manager executing tasks in parallel to IO instead of blocking on IO should be faster... So the problem must be in the implementation.

_pmf_ · on June 12, 2020

> I think it is surprising to a lot of people who do take it as read that async will be faster.

Literally the first thing any concurrency course starts with in the very first lesson is that scheduling and context overhead are not negligible. Is it so hard to expect our professionals to know basic principles of what they are dealing with?

dspillett · on June 12, 2020

> think it is surprising to a lot of people who do take it as read that async will be faster.

This is because when they are first shown it, the examples are faster, effectively at least, because the get given jobs done in less wallclock time due to reduced blocking.

They learn that but often don't get told (or work out themselves) that in many cases the difference is so small as to be unmeasurable or in other circumstances the can be negative effects (overheads others have already mentioned in the framework, more things waiting on RAM with a part processed working day which could lead to thrashing in a low memory situation, greater concurrent load on other services such as a database and the IO system it depends upon, etc).

As a slightly of-the-topic-of-async example, back when multi-core processing was first becoming cheap enough that it was not just affordable at give but the default option, I had great trouble trying to explain to a colleague why two IO intensive database processes he was running were so much slower than when I'd shown him the same process (I'd run them sequentially). He was absolutely fixated on the idea that his four cores should make concurrency the faster option, I couldn't get through that in this case the flapping heads on the drives of the time were the bottleneck and the CPU would be practically idle no matter how many cores it had while the bottleneck was elsewhere.

Some people learn the simple message (async can handle some loads much more efficiently) as an absolute (async is more efficient) and don't consider at all that the situation may be far more nuanced.

nurettin · on June 13, 2020

> An async implementation that does multiple ("embarrassingly parallel") tasks in the same process

You mean concurrent tasks in the same process?

lucideer · on June 12, 2020

> I don't think that people who think async is faster have unreasonable expectations

I do.

And I don't think I'm alone nor being unreasonable.

kerkeslager · on June 12, 2020

> The point of coroutines isn't to make your code execute faster, it's to prevent your process sitting idle while it waits for I/O.

This is a quintessential example of not seeing the forest for the trees.

The point of coroutines is absolutely to make my code execute faster. If a completely I/O-bound application sits idle while it waits for I/O, I don't care and I should not care because there's no business value in using those wasted cycles. The only case where coroutines are relevant is when the application isn't completely I/O bound; the only case where coroutines are relevant is when they make your code execute faster.

It's been well-known for a long time that the majority of processes in (for example) a webserver, are I/O bound, but there are enough exceptions to that rule that we need a solution to situations where the process is bound by something else, i.e. CPU. The classic solution to this problem is to send off CPU-bound processes to a worker over a message queue, but that involves significant overhead. So if we assume that there's no downside to making everything asynchronous, then it makes sense to do that--it's not faster for the I/O bound cases, but it's not slower either, and in the minority-but-not-rare CPU-bound case, it gets us a big performance boost.

What this test is doing is challenging the assumption that there's no downside to making everything asynchronous.

In context, I tend to agree with the conclusion that there are downsides. However, those downsides certainly don't apply to every project, and when they do, there may be a way around them. The only lesson we can draw from this is that gaining benefit from coroutines isn't guaranteed or trivial, but there is much more compelling evidence for that out there.

michaelcampbell · on June 12, 2020

> The point of coroutines is absolutely to make my code execute faster.

I think rather the point is to make your APPLICATION either finish in less time, or to not take MORE time when given more load.

The code runs as fast as it runs, coroutines notwithstanding.

kerkeslager · on June 12, 2020

> > The point of coroutines is absolutely to make my code execute faster.

> I think rather the point is to make your APPLICATION either finish in less time, or to not take MORE time when given more load.

Potato potato.

michaelcampbell · on June 12, 2020

Well, sure, anything can mean anything if you're willing to redefine what words mean.

kerkeslager · on June 12, 2020

The meaning of words is determined by usage. Usage of words is determined by the meaning. This circular definition causes the inherent problem of language: words don't have inherent meaning. The best I can do is to attempt to use words in a way similar to the way that you use words, but I can only ever make an educated guess about how you use words, so it's never going to be perfect.

And from my perspective, I don't think it's unreasonable for me to expect you to try to understand what I'm trying to communicate, rather than attempting to force me to use different words. The burden of communication is shared by both speaker and listener.

earthboundkid · on June 12, 2020

“Faster” is not a well defined technical term. It is a piece of natural language that can easily refer to max time, mean time, P99, latency, throughput, price per watt, etc. depending on context.

BiteCode_dev · on June 12, 2020

This is not what this article is about.

The surprising conclusion of the article is that on a realistic scenario, the async web frameworks will ouput less requests/sec than the sync ones.

I'm very familiar with Python concurrency paradigms, and I wasn't expecting that at all.

Add to that zzzeek's article (the guy wrote SQLA...) stating async is also slower for db access, this makes async less and less appealing, given the additional complexity it adds.

Now appart from doing a crawler, or needing to support websockets, I find hard to justify asyncio. In fact, with David Beasley hinting that you probably can get away with spawning a 1000 threads, it raises more doubts.

The whole point of async was that, at least when dealing with a lot of concurrent I/O, it would be a win compared to threads+multiprocessing. If just by cranking the number of sync workers you get better results for less complexity, this is bad.

DougBTX · on June 12, 2020

As far as I can tell, the main cost of threads is 2-4MB of memory usage for stack space, so async allows saving memory by allowing one thread to process more than one task. A big deal if you have a server with 1GB of memory and want to handle 100,000 simultaneous connections, like Erlang was designed for. But if the server has enough memory for as many threads that are needed to cover the number of simultaneous tasks, is there still a benefit?

BiteCode_dev · on June 12, 2020

Now the $1000 question would be, if you pay for the context switching of BOTH threads and asyncio, having 5 processes, which each 20 threads, within each an event loop, what happens?

Is the price of the context switching too high, or are you compensating the weakness of each system, by handling I/O concurrently in async, but smoothing the blocking code outside of the await thanks to threads?

Making a _clean_ benchmark for would it be really hard, though.

BiteCode_dev · on June 13, 2020

Anwsering my own comment cause I can't edit it anymore, but this article has started a heated debate on tweeter.

The author of "black" suggested that the cause of the slow down may be that asyncio actually starved postgres for resources:

https://twitter.com/llanga/status/1271719783080366086

zzzeek · on June 12, 2020

> When you're dealing with external REST APIs that take multiple seconds to respond, then the async version is substantially "faster" because your process can get some other useful work done while it's waiting. Obviously the async framework introduces some overhead, but that bit of overhead is probably a lot less than the 3 billion cpu cycles you'll waste waiting 1000ms for an external service.

but threads get you the same thing with much less overhead. this is what benchmarks like this one and my own continue to confirm.

People often are afraid of threads in Python because "the GIL!" But the GIL does not block on IO. I think programmers reflexively reaching for Tornado or whatever don't really understand the details of how this all works.

danbruc · on June 12, 2020

but threads get you the same thing with much less overhead.

That is not true, at least not in general, the whole point of using continuations for async I/O is to avoid the overhead of using threads, the scheduler overhead, the cost of saving and restoring the processor state when switching tasks, the per thread stack space, and so on.

catblast · on June 12, 2020

The scheduler overhead and the cost of context-switches are vastly overstated compared to alternatives. The per thread stack space in effect has virtually no run-time cost, and starting off at a single 4k page for a stack, thousands still only waste a miniscule about of memory.

camgunz · on June 12, 2020

async implementations build a scheduler into the runtime, and that's generally slower than the OS' scheduler. 10-100x slower if it's not in C (or whatever).

aviba · on June 12, 2020

GIL might not block on I/O but the implementation that uses PyObject does need the GIL no?

mavdi · on June 12, 2020

I get enraged when articles like this get upvotes. The evidence given doesn't at all negate the reasoning behind using async, which as you said, is about not having to be blocked by IO, not freaking throughput test for an unrealistic scenario. Just goes to show the complete lack of understanding of the topic. I wouldn't dare write something up if I didn't 100% grasp it, but the bar is way lower for some others it seems.

didibus · on June 12, 2020

I don't know the async Python specifics, but from what I understand, you don't necessarily need async to handle large number of IO requests, you can simply use non-blocking IO and check back on it synchronously either in some loop or at specific places in your program.

The use of async either as callbacks, or user threads, or coroutines, is a convenience layer for structuring your code. As I understand, that layer does add some overhead, because it captures an environment, and has to later restore it.

hinkley · on June 12, 2020

I'm starting to wonder what the origin story is for titles like this. Have CS programs dropped the ball? Did the author snooze through these fundamentals? Or are they a reaction to coworkers who have demonstrated such an educational gap?

Async and parallel always use more CPU cycles than sequential. There is no question. He real questions are: do you have cycles to burn, will doing so brings the wall clock time down, and is it worth the complexity of doing so?

Izkata · on June 12, 2020

I think it's because "async" has been overloaded. The post isn't about what I thought it would be upon seeing the title.

I was thinking this would be about using multiprocessing to fire off two or more background tasks, then handle the results together once they all completed. If the background tasks had a large enough duration, then yeah, doing them in parallel would overcome the overhead of creating the processes and the overall time would be reduced (it would be "faster"). I thought this post would be a "measure everything!" one, after they realized for their workload they didn't overcome that overhead and async wasn't faster.

Upon what the post was about, my response was more like "...duh".

danbruc · on June 12, 2020

Obviously the async framework introduces some overhead, but that bit of overhead is probably a lot less than the 3 billion cpu cycles you'll waste waiting 1000ms for an external service.

Waiting for I/O does usually not waste any CPU cycles, the thread is not spinning in a loop waiting for a response, the operating system will just not schedule the thread until the I/O request completed.

toxik · on June 12, 2020

Sigh. Async is somewhat orthogonal to parallel.

You are making dinner. You start to boil water for the potatoes. While that happens, you prepare the beef. Async.

You and your girlfriend are making dinner. You do the potatoes, she does the beef. Parallel.

You can perhaps see how you could have asynchronous and parallel execution at the same time.

In the context of a Web server, a request is handled by a single Python process (so don’t give me that “OS scheduler can do other things”). Async matters here because your request turnover can be higher, even if the requests/sec remains the same.

In the cooking example, each request gets a single cook. If that cook is able to do things asynchronously, he will finish a single meal faster.

If it were only parallel, you could have more cooks - because they would be less demanding - but they would each be slower.

stock_toaster · on June 12, 2020

> In the cooking example, each request gets a single cook. If that cook is able to do things asynchronously, he will finish a single meal faster.

There is a bit of nuance here, in that the async-chef would make any individual meal slower than a sync-chef, once the number of outstanding requests is large. The sync-chef would indeed have overall higher wait times, but each meal would process just as fast as normal (eg. more like a checkout line at a grocery store).

I prefer the grocery store checkout line metaphor for this reason. If a single clerk was "async" and checking out multiple people at once, all the people in a line would have an average wait time of roughly the same for a small line size. A "sync" clerk would have a longer line with people overall waiting longer, but each individual checkout would take the same amount of time once the customer managed to reached the clerk.

This is pertinent when considering the resources utilized during the job. If an sync clerk only ever holds a single database connection, while an async clerk holds one for every customer they try to check out at the same time, the sync clerk will be far more friendly to the database (but less friendly to the customers, when there aren't too many customers at once).

toxik · on June 15, 2020

I think you managed to miss the point: the async chef is doing other stuff necessary to fulfill a single order when he can, i.e., while the potatoes are boiling. The sync chef has to wait for the potatoes to boil, only when those are done can he start to fry the beef.

The sync chef doesn't occupy the frying pan when he's boiling potatoes, so in some sense he only really does as much as he can. Having hundreds of sync chefs would likely be more efficient in terms of order volume, _but not order latency._

danbruc · on June 12, 2020

I do not disagree with that, my point was just that you are not wasting clock cycles, you may however, as you pointed out, be wasting time while waiting for I/O to complete which you could potentially make better use of by using some more clock cycles while the I/O operation is in progress to do more work which is not dependent on the I/O result.

toxik · on June 12, 2020

I didn’t mean to disagree with you, I just wanted to put my take on it out there

maxmalysh · on June 12, 2020

https://en.wikipedia.org/wiki/C10k_problem

rumanator · on June 12, 2020

> How is this result surprising? The point of coroutines isn't to make your code execute faster, it's to prevent your process sitting idle while it waits for I/O.

It depends on what you mean by "faster". HTTP requests are IO bound, thus it is to be expected that the throughout of a IO bound service benefits from a technology that prevents your process from sitting idle while waiting for IO.

Thus it's surprising that Python's async code performs worse, not better, in both throughput and latency.

> When you're dealing with external REST APIs that take multiple seconds to respond, then the async version is substantially "faster"

The findings reported in the blog post you're commenting are the exact opposite of your claim: Python's async performs worse than it's sync counterpart.

ashtonkem · on June 12, 2020

We need to stop saying “faster” with regards to async. The point of async was always either fitting more requests per compute resource, and/or making systems more latency consistent under load.

“Faster” is misleading because the speed improvements that you get with async is very dependent on load. At low levels there is going to typically be negligible or no speed gains, but at higher levels the benefit will be incredibly obvious.

The one caveat to this is cases where async allows you to run two requests in parallel, rather than sequentially. I would argue that this is less about async than it is about concurrency, and how async work can make some concurrent work loads more ergonomic to program.

zzzeek · on June 12, 2020

you just contradicted yourself:

> “Faster” is misleading

and

> "At low levels there is going to typically be negligible or no speed gains, but at higher levels the benefit will be incredibly obvious."

there are no "speed" gains period. the same amount of work will be accomplished in the same amount of time with threads or async. async makes it more memory efficient to have a huge number of clients waiting concurrently for results on slow services, but all of those clients walking off with their data will not be reached "faster" than with threads.

the reason that asyncio advocates say that asyncio is "faster" is based on the notion that the OS thread scheduler is slow, and that async context switches are some combination of less frequent and more efficient such that async is faster. This may be the case for other languages but for Python's async implementations it is not the case, and benchmarks continue to show this.

ashtonkem · on June 12, 2020

I did not contradict myself; saying that async is “faster” implies speed gains in all circumstances. In reality the benefits of async io is extremely load dependent, which is why I don’t want to call it “faster”.

vertex-four · on June 12, 2020

The other thing about async is that, in some scenarios, it can make shared resource use clearer - i.e. in a program I've written, the design is such that one type on one thread (a producer) owns the data and passes it to consumers directly, rather than trying to deal with lock-free algorithms and mutexes for sharing the data and suchlike. A multi-threaded ring buffer is much less clearly correct than a single-threaded one.

delusional · on June 12, 2020

> but that bit of overhead is probably a lot less than the 3 billion cpu cycles you'll waste waiting 1000ms for an external service.

You are not waiting for that 1000ms, and you haven't been for 35 years since the first os's starting feature preemptive multitasking.

When you wait on a socket, the OS will remove you from the CPU and place someone who is not waiting. When data is ready, you are placed back. You aren't wasting the CPU cycles waiting, only the ones the OS needs to save your state.

Actually standing there and waiting on the socket is not a thing people have done for a long time.

pdpi · on June 12, 2020

> You are not waiting for that 1000ms, and you haven't been for 35 years since the first os's starting feature preemptive multitasking.

The point is that async IO allows your own process/thread to progress while waiting for IO. Preemptive multitasking just assigns the CPU to something else while waiting, which is good for the box as a whole, but not necessarily productive for that one process (unless it is multithreaded).

hedora · on June 12, 2020

Sync I/O lets your process (not thread) do something else. In other languages, async I/O is faster because it avoids context switches and amortizes kernel crossings. Apparently this is not the case in practice for python.

This doesn’t surprise me at all, as I’ve had to deal with async python in production, and it was a performance and reliability nightmare compared to the async Java and C++ it interacted with.

dilandau · on June 12, 2020

>it's to prevent your process sitting idle while it waits for I/O.

...with the goal of making your application faster.

arghwhat · on June 12, 2020

... no. With the goal of allowing concurrency without parallelism.

In doing that, you're removing natural parallelism, and end up competing with the kernel scheduler, both in performance and in scheduling decisions.

parhamn · on June 12, 2020

This is a lazy argument. We get it, you know what coroutines are and how the kernel scheduler works (also everyone else in this thread).

That doesn't matter though. If you think the average python user is looking for "concurrency without parallelism" with no speed/performance goal in mind, you totally have the wrong demographic.

The fact that the language chose to implement asyncio on a single thread (again the end user doesn't care that this is the case, it could have been thread/core abstraction like goroutines), with little gain, which lead to a huge fragmentation of its library ecosystem is bad. Even worse that it was done in 2018. Doesn't matter how smart you are about the internals.

arghwhat · on June 12, 2020

How in the world did you come to the conclusion that I thought Python users wanted that? I simply concluded that it's the only thing it provides. I wasn't saying it was a good thing, which I think was what you might have read it as.

Python implements things on a single thread due to language restrictions (or rather, reference implementation restrictions), as the GIL as always disallows parallel interpreter access, so multiple Python threads serve little purpose other than waiting for sync I/O. It's been many years since I followed Python development, but back then all GIL removal work had unfortunately come to a halt...

parhamn · on June 12, 2020

> ...with the goal

> ... no. With the goal

I assumed those meant the end user of the language (it is fair to assume the person you responded to meant that). The goal of the language itself was probably to stay trendy - e.g. JS/Golang/Nim/Rust/etc had decent async stories, where python didn't. Python needed async syntax support as the threading and multiprocessing interfaces were clunky compared to others in the space. What they ended up with arguably isn't good.

I'm pretty familiar with those restrictions which is why I expected this thread to be more of "yeah it sucks that its slower" instead of pulling the "coroutines don't technically make anything faster per se" argument which is distracting.

crimsonalucard1 · on June 12, 2020

I see this elitist attitude all over the internet. First it was people saying “Guys why are you over reacting to corona the flu is worse.”

Then it was people saying “Guys, stop buying surgical masks, The science says they don’t work it’s like putting a rag over your mouth.”

All of these so called expert know it alls were wrong and now we have another expert on asynchronous python telling us he knows better and he’s not surprised. No dude your just another guy on the internet pretending he’s a know it all.

If you are any good, you’ll realize that nodejs will beat the flask implementation any day of the week and the nodejs model is exactly identical to the python async model. Nodejs blew everything out of the water, and it showed that asynchronous single threaded code was better for exactly the test this benchmark is running.

It’s not obvious at all. Why is the node framework faster then python async? Why can’t python async beat python sync when node can do it easily? What is the specific flaw within python itself that is causing this? Don’t answer that question because you don’t actually know man. Just do what you always do and wait for a well intentioned humble person to run a benchmark then comment on it with your elitist know it all attitude claiming your not surprised.

Is there a word for these types of people? They are all over the internet. If we invent a label maybe they’ll start becoming self aware and start acting more down to earth.

ben509 · on June 12, 2020

> Nodejs blew everything out of the water

Node's JIT comes from a web browser's javascript implementation used by billions of people. It's also had async baked in from day one.

Python started single process, added threading, and then bolted async on top of that. And CPython is a pretty straight interpreter.

A comparison between Node and PyPy would be more informative, but PyPy has a far less mature JIT and still has to deal with Python's dynamism.

> If we invent a label maybe they’ll start becoming self aware and start acting more down to earth.

You can't lecture people into self-awareness, any more than experts can lecture everyone into wearing masks.

crimsonalucard1 · on June 13, 2020

Except IO is the bottleneck here. The concurrency model for IO should determine overall speed. If python async is slower for IO tasks then sync then that IS an unexpected result and an indication of a python specific problem.

ben509 · on June 14, 2020

> Except IO is the bottleneck here.

If you say IO is the bottleneck, then you're claiming there is no significant difference between python and node. That's what a bottleneck means.

> The concurrency model for IO should determine overall speed.

"Speed" is meaningless, it's either latency or throughput. Yeah, yeah, sob in your pillow about how mean elites are, clean up your mascara, and learn the correct terminology.

We've already claimed the concurrency model is asynchronous IO for both python and node. Since they are both doing the same basic thing, setting up an event loop and polling the OS for responses, it's not an issue of which has a superior model.

> If python async is slower for IO tasks then sync then that IS an unexpected result and an indication of a python specific problem.

Both sync and async IO have their own implementations. If you read from a file synchronously, you're calling out to the OS and getting a result back with no interpreter involvement. This[2] is a simple single-threaded server in C. All it does is tell the kernel, "here's my IO, wake me up when it's done."

When you do async work, you have to schedule IO and then poll for it. This[1] is an example of doing that in epoll in straight C. Polling involves more calls into the kernel to tell it what events to look for, and then the application has to branch through different possible events.

And you can't avoid this if you want to manage IO asynchronously. If you use synchronous IO in threading or processes, you're still constructing threads or processes. (Which makes sense if you needed them anyway.)

So unless an interpreter builds its synchronous calls on top of async, sync necessarily has less involvement with both the kernel and interpreter.

The reason the interpreter matters is because the latency picture of async is very linear:

* event loop wakes up task * interpreter processes application code * application wants to open / read / write / etc * interpreter processes stdlib adding a new task * event loop wakes up IO task * interpreter processes stdlib checking on task * kernel actually checks on task

Since an event loop is a single-threaded operation, each one of these operations is sequential. Your maximum throughput, then, is limited by the interpreter being able to complete IO operations as fast as it is asked to initiate them.

I'm not familiar enough with it to be certain, but Node may do much of that work in entirely native code. Python is likely slow because it implements the event loop in python[3].

So, not only is Python's interpreter slower than Node's, but it's having to shuffle tasks in the interpreter. If Node is managing a single event loop all in low level code, that's less work it's doing, and even if it's not, Node can JIT-compile some or all of that interpreter work.

[1]: https://github.com/o0myself0o/epoll/blob/master/epoll.c

[2]: https://www.programminglogic.com/example-of-client-server-pr...

[3]: https://github.com/python/cpython/blob/3.8/Lib/asyncio/unix_...

crimsonalucard1 · on June 14, 2020

>If you say IO is the bottleneck, then you're claiming there is no significant difference between python and node. That's what a bottleneck means.

This is my claim that this SHOULD be what's happening under the obvious logic that tasks handled in parallel to IO should be faster then tasks handled sequentially and under the assumption that IO takes up way more time then local processing.

Like I said the fact that this is NOT happening within the python ecosystem and assuming the axioms above are true, then this indicates a flaw that is python specific.

>The reason the interpreter matters is because the latency picture of async is very linear:

I would say it shouldn't matter if done properly because the local latency picture should be a fraction of the time when compared to round trip travel time and database processing.

>Python is likely slow because it implements the event loop in python

Yeah, we're in agreement. I said it was a python specific problem.

If you take a single task in this benchmark for python. And the interpreter spends more time processing the task locally then the total Round trip travel time and database processing time... Then this means the database is faster than python. If database calls are faster then python then this is a python specific issue.

catalogia · on June 12, 2020

You're making the classic mistake of assuming a common thread connects the people who've annoyed you in various unrelated contexts.

zaptheimpaler · on June 12, 2020

I mean no one even mentioned node. Maybe it is faster idk. But we're talking about python?