The problem with going to the "top of the vertical" scaling so to speak is that ...

toast0 · on Feb 5, 2020

Assuming you have a relatively stable growth curve, you should have some ability to predict how long your hardware upgrades will last.

With that, you can start planning your rearchitecture if you're running out of upgrades, and start implementing when your servers aren't yet on fire, but are likely to be.

Today's server hardware ecosystem isn't advancing as reliably as it was 8 years ago, but we're still seeing significant capacity upgrades every couple years. If you're CPU bound, the new Zen2 Epyc processors are pretty exciting, I think they also increased the amount of accessible ram, which is also a potential scaling bottleneck.

jedberg · on Feb 5, 2020

> Assuming you have a relatively stable growth curve, you should have some ability to predict how long your hardware upgrades will last.

But that's not how the real world works. The databases don't just slowly get bad. They hit a wall, and when they do it is pretty unpredictable. Unless you have your scaling story set ahead of time, you're gonna have a bad day (or week).

toast0 · on Feb 5, 2020

If you're lucky, the wall is at 95-100% cpu. Oftentimes, we're not that lucky, and when you approach 60%, everything gets clogged up, I've even worked on systems where it was closer to 30%.

Usually, databases are pretty good at running up to 100%, though. And if you started with small hardware, and have upgraded a few times already, you should have a pretty good idea of where your wall is going to hit. Some systems won't work much better on a two socket system than a one socket system, because the work isn't open to concurrency, but again, we're talking about scaling databases, and database authors spend a lot of time working on scaling, and do a pretty good job. Going vertically up to a two socket system makes a lot of sense on a database; four and eight socket systems could work too, but get a lot more expensive pretty fast.

Sometimes, the wall on a databases is from bad queries or bad tuning; sharding can help with that, because maybe you isolate the bad queries and they don't affect everyone at once, but fixing those queries would help you stay on a single database design.

bcrosby95 · on Feb 5, 2020

The minute your RDBMS' hot dataset doesn't fit into memory its going to shit itself. I've seen it happen anywhere from 90% CPU down to around 10%. Queries that were instant can start to take 50ms.

It can be an easy fix (buy more memory), but the first time it happens it can be pretty mysterious.

throwaway5752 · on Feb 5, 2020

There's no CPU wall. You have assets: CPU, memory, disk bandwidth/latency, and then structural decisions in your schema. The key is knowing your performance characteristics and where you will start queuing. That's hard to figure out. Practically speaking, you're right, and you're 100% right about unreplicated/unsharded data stores eventually hitting a wall and needing to have a strategy how/when to scale. I just noticed your username and feel silly for telling you stuff you already know far better than me, but posting it anyway in case it benefits others.

wolco · on Feb 5, 2020

That's exactly how the real world works. Databases will get slow, then slower. Resources get used. Unpredictable not really. Maybe you've run out of space or ram or processes are hanging. The database will never just start rending html or formatting your disk or email someone. It is pretty predictable.

jedberg · on Feb 5, 2020

The failure I've seen multiple times is that the database is returning data within normal latencies, and then there is a traffic tipping point and the latencies go up 1000x for all requests.

tilolebo · on Feb 6, 2020

Capacity planning can't be solely linked to the growth curve. This assumes that the number and complexity of your SQL queries never evolve, which isn't true in most cases.

Your will implement new features, add new tables and columns, indexes, etc which will affect your data layer.

NicoJuicy · on Feb 5, 2020

I actually implemented domain driven design WITH an Api-layer ( so core, application, Infrastructure + api). They also are split on Basket, catalog, checkout, shipping and pricing domain with seperate db's.

So just splitting up the heaviest part (eg. Catalog) into "a microservice" would be easy while I add nginx as load balancer. I already separated domain vs Integration Events.

Both now use events in memory in the application, I only need a message broker like NATS then for the integration events.

It would be a easy wall ;). I have multiple options like heavier hardware, splitting up the db from application server or splitting up a domain bound api to a seperate server.

As long as I don't need multimedia streaming, kubernetes or implement Kafka the future is clear.

Ps. Load balancing based on tenant and cookie would be a easy fix in extreme circomstances.

The thing I'm afraid for the most is hitting the identity server for authentication/token verification. Not sure if it's justified though.

Side note: one application has an insane amount of complex joins and will not scale :)

lllr_finger · on Feb 5, 2020

DID is an extremely important concept that is alien to a lot of developers: Deploy for 1.5X, Implement for 3X, Design for 10X (your numbers may vary slightly)