It’s amazing that this isn’t a solved problem, but we have all of this crazy language model stuff.
Unfortunately Spanner isn’t open source. Yugabyte and Citus are close but have annoying issues. Cockroach isn’t 100% compatible (and has its own issues) and things like FoundationDB which are truly HA and comparable to Spanner in terms of consistency and fault tolerance are not easily plugged into Postgres as the underlying storage engine since sadly it’s only a key value store.
edit: when I say close, I'm talking strictly about HA, not general functionality.
lately I've been thinking of using FoundationDB, which is closest to Spanner in terms of ACID and serializability and mvsqlite.
Then, I was thinking, since SQLite doesn't have online schema changes (nor does mvsqlite) to have a schema such as:
[UUID, Data, Version, CreatedAt, UpdatedAt]
Where Data is a JSON or Proto and Version is an integer. You then could mimic an online schema change by in your application code supporting two adjacent "versions", and then in an eventually consistent manner run [small] transactions to update the Data to the new Version as necessary. You would index Version, and UpdatedAt as necessary to find the rows in the table that are not "migrated."
In SQLite you can also create indexes on expressions so technically all of your JSON or Proto could also have indexes.
Citus is not close to Spanner. No global secondary indexes, no cross-shard ACID (the commit status is eventually consistent). No auto-resharding.
Yugabyte and Cockroach have a spanner-like architecture, but different open-source model and postgres-compatibility. What are those annoying issues?
Honestly, just use Spanner. It's the perfect database*. If you need to trade freshness for savings, put a cache in front of it.
If I were building a new startup in 2023, I would need a mountain of evidence against using Spanner. It's ugly that it locks you into GCP but hey an iPhone locks you into Apple's ecosystem, that's just the price you pay to get good things.
* unless you need timeseries, columnar, FTS, geospatial, graph or something special like that
Spanner is ridiculously expensive. Though I get your point, it’s not like an iPhone. An iPhone is not actually that expensive compared to top end android phones.
Spanner is very expensive even compared to hosted cockroach and RDS (Aurora). That being said I’m sure if you're enterprise it's substantially cheaper.
Spanner generally is about twice the cost as cockroach which seems cheap but the gap in nominal cost only grows with usage.
The other issue is that since there’s no source to self host you also have to pay, unlike cockroach.
Generally with cockroach I’d expect one to self host all instances except a staging/pre-pod and prod. With Spanner all your instances necessarily will be hosted, which means $$.
That being all said, Spanner is worth it if money isn’t an issue
The thing is, there is nothing else like Spanner, unless you want to trust a startup with your database, which opens up Cockroach, Yuga, and TiDB. Databases like RDS don't compare because they don't offer the same guarantees. Nothing even in AWS' and Azure's portfolios is like Spanner.
Back of napkin math I did a couple of months ago: Spanner costs about $1/month for 2 writes/second and 10 reads/second. The minimum is $88 which is 176wps and 880rps. If your startup gets more traffic than that, you have good problems. The simplified guarantee model (fully linearizable and serializable everything; automagic sharding, replication and failover) could save you costs elsewhere.
Low-value high-volume data like logs and search queries can be sent to a cheaper self-hosted timeseries/fts database.
I've built my own little pile of evidence against Spanner using it for several years, after previously being on Postgres. It has good potential but isn't there yet. And if you're making a startup, you're probably a long way from the scale where sharded DBs become necessary.
I hate this tired argument. If I'm building a startup I'm building it with the intention to hit scale. If I can use cockroach or yugabyte and get 90% of the benefits of postgres without significant additional cost I'm going to use it. I've been at multiple startups that have had trouble scaling their database and the cost to switch database technologies when you hit scale is massive. And worst of all, the limitations of the technology always impact product development. I don't care what cool features postgres has if I can't do online schema changes.
So yeah, I'd rather spend a few more hours picking an actual scalable technology now than spend what could be 7 figures (it can be a lot more, I've heard of this migration uber had to do that was costing them half a billion a year) and massively impact product development moving to scalable technology..
> If I can use cockroach or yugabyte and get 90% of the benefits of postgres without significant additional cost I'm going to use it
Here's the problem, it's not 90%. Idk about those two, but you sacrifice a lot using Spanner, maybe enough to delay your launch. Any slightly advanced query becomes hard to optimize, for one. Not to dis Spanner, it's just a very different beast.
CEO won't want to hear "launch is delayed, but it'll scale better later," and it's also not really true. In the initial phases, you can't tell what your actual needs will be at scale, so you'll probably have to rework things anyway. Just using Spanner doesn't solve your scaling problems. Maybe you'll need to totally redo your Spanner schema to actually scale, maybe you find you're better off sharding at the application layer than at the DB.
Different teams have different abilities. You can't really make general rules around any of this. Should you use spanner as a default selection for your company's database needs? Maybe.
Some general rules make sense for common use cases. Like, I'll say that a relational DBMS is generally what you want rather than a graph one. For a single-node DB, I'll always use Postgres if I have the choice. For a webserver, can't really go wrong with NodeJS, though you might have reasons to use something else (e.g. cost to run).
But sharding is too use-case specific to prescribe something broadly like Spanner.
both good points, including the devs having different abilities and using a ready-made scalable solution. We're living in the world of having something working that you can test in a few days shrug
Doesn't spanner introduce new (very unlikely) failure modes that other databases are not impacted by? The reliance on an external consistency model feels to me like a complete outsourcing of liability that warrants thorough investigation.
Hypothetically, if GPS went down for a prolonged period and/or a bug was found in the TrueTime system, what would happen to the consistency model around Spanner?
I feel like some applications and customers would much rather wait for a synchronous acknowledgement from the actual, live system. An extra 150ms when you are confirming a 6-figure wire transfer could easily be framed as a good thing in most circles.
Spanner offers both linearizability and serializability. They use the non-standard term "external consistency" to describe the union of those two properties.
>Doesn't spanner introduce new (very unlikely) failure modes that other databases are not impacted by?
Doesn't Spanner depend on access to globally-synchronized custom atomic clock hardware? i.e. even if it were open source, it wouldn't really be valuable unless you're a data center operator.
As I understand it, it's not a solved problem because there is no silver bullet, but rather trade-offs in every direction, and which solution works for you (including Spanner) is heavily dependent on your use case.
the clocks help but are not strictly required. FoundationDB also achieves external consistency without the clocks, but has limitations, such as a 5 second window for a transaction, due to optimistic concurrency.
There are some fundamentally really hard problems with multi-node DBs, and the usable solutions depend a lot on your use case, so I'm not surprised at the current state.
Great summary. Multi-node relational DBs are very much double-edged swords. This article mentions the potential data loss during failover events. I've complained in the past that Heroku Postgres advertises HA like a strict improvement and leaves the actual failover mechanism in the very fine print; the standby master is async, so you can lose data. There's no free lunch here, you either tolerate some data loss or tolerate extra latency before commits.
When you get into sharded DBs, a lot of limitations aren't obvious at first. I've used Citus a long time ago and Spanner much more recently. Spanner feels almost like NoSQL: Each table is like a distributed KV store. Indexes are just tables where the PK is the indexed col(s) and the val is the main table's PK, though you can also store additional cols (aka denormalize) in the index to avoid an extra join. You can't combine two indexes to filter on two cols; you need one composite index. More advanced things like order-limit, WHERE (NOT) IN, and subqueries tend to be slow. The query planner is pretty limited, and often I just force things. You also have to really know what you're doing with the pkeys. Which is all understandable, given its requirements.
I forget the limitations of Citus ACID, but they're significant. Spanner has full ACID, but even basic operations are quite slow. Single-node Postgres actually doesn't have full ACID unless you put it in the slower SERIALIZABLE mode, but you don't really need it.
I agree with those who say it's better to focus on sharding at the application layer if possible. If you can't do that, it's probably due to the underlying nature of your problem, in which case sharding at the DB layer in an efficient way tends to be even harder. Sometimes it makes sense, but it's not magic.
CREATE TABLE foo (id bigserial, bar int, baz int);
CREATE INDEX foobar ON foo(bar);
CREATE INDEX foobaz ON foo(baz);
SELECT \* FROM foo WHERE bar = 2 AND baz = 4;
Postgres (and I think MySQL) will use both indexes in the above query*. Spanner can only use one index, which will be slow if there are many non-matching bar=2 or baz=4 rows.
So Spanner needs CREATE INDEX foobarbaz ON foo(bar, baz); Which Postgres could also use, and it'd be a bit faster, but the index-combining is decently fast too and much nicer when you consider a table with like 10 cols and many ways to filter/join.
Bitmap Scan relies on shared buffers. Distributed SQL cannot share the buffers between nodes. BitmapAnd is nice but expensive.
I prefer a compound index on foo(bar, baz) but PostgreSQL needs an additional index if there are query with condition on baz only. YugabyteDB, thanks to hybrid scan, can use this single index also for SELECT * FROM foo WHERE baz = 4;
Forgot to add, any Spanner query that I expect to be reasonably fast for realtime usage has to be very simple. In Postgres I often get away with much more complex ones.
The default read-committed is good enough for most use cases, explained in https://www.postgresql.org/docs/current/transaction-iso.html (search "it is just right for simpler cases"). If somehow you've designed a schema that causes race conditions in the default mode, you're probably better off changing your design than you are switching modes.
Serializable mode uses some kind of optimistic concurrency, where your transaction might get halfway through then fail because of a conflict with another one, which also fails. Then you have to do retry + random backoff on the client side. Spanner does something similar. Problem with Postgres serializable mode is it's slow and won't scale well if many readers/writers are touching the same data.
I get it if this answer isn't satisfying. Partial ACID is worrisome, and full ACID is expensive.
one of the solutions which made it pretty simple for us to run postgresql in a ha environment (mostly in k8s, but works standalone as well) is zalandos patroni: https://github.com/zalando/patroni it's really solid and worked for us for a few years already. (it also comes with a haproxy config to have a single leader connection)
manual intervention is not required. Patroni will just use a Standby as the new master. If the old master will be alive again it will be started as a backup. Of course there might be dataloss if it was used with async replication.
I just implemented master with read replica with the bitnami postgres-repmgr image. It's not perfect, but it works and running my own instances in aws instead of rds is going to save me close to 80% when i also add in purchasing a savings plan. By setting this up I've learned more about postgres than i ever ever wanted too. lol
Just a heads up, we were using the bitnami charts in our first attempt at an HA postgres instance in our k8s cluster, and we're currently figuring out how to move away from them because they had so many issues.
Granted, we might've messed something up and there are lots of factors, but you should manually check your individual nodes every so often to make sure none of them are going out of sync. Need to connect to each one directly, can't rely on what the bouncer/pgpool/loadbalancer is showing you to check the data is the same. It happened repeatedly to us, and wasn't obvious from any of our monitoring. In the end we had to scale down to 1 node while we sort out moving to a different operator.
I wished AWS, Google and others would also add a Citus solution. I fear that once migrated to Citus, we face the full price pressure from the Azure monopoly.
I personally don't consider the AGPL to be a free software nor open source license. The burden it places on operating and using the software sufficiently violates freedom 0 in my view.
denoting software for which the original source code is made freely available and may be redistributed and modified.
> agpl limitations
The AGPL License does not permit sublicensing of the code; that is, you cannot rework or add to the code and then close those changes off to the public.
considering these facts, your opinion is honestly ... pretty dumb.
There is nothing hard about forking the repository and then creating a readonly mirror of your working copy on a public github repo.
> denoting software for which the original source code is made freely available and may be redistributed and modified.
Being honest, I didn't examine the OSD closely when writing the comment, but I still feel AGPL might violate rule #10 for the open source criterion.[0]
However, it's much more clear cut going over to the free software side.[1] Specifically, the explanatory text in the free software philosophy states: "The freedom to run the program means the freedom for any kind of person or organization to use it on any kind of computer system, for any kind of overall job and purpose, without being required to communicate about it with the developer or any other specific entity."
The AGPL requires that you communicate your use of the software, and further forces you to provide access to the source code. For unmodified copies, it might be sufficient to link back to an upstream project. As soon as you modify even a single line, you must now provide your own fork, which will involve setting up some sort of infrastructure. That's assuming that the source code still exists (you might be surprised how often it's lost).
> considering these facts, your opinion is honestly ... pretty dumb.
Google has Spanner. AWS is working on something similar. And both have YugabyteDB in their marketplace. Those are Distributed SQL (Global ACID), not Citus. For DataWarehouse which doesn't need ACID, there are other services.
Good product. Used it for a multi tenant implementation. Pretty easy to get started with, can get hairy before reaching operational excellence but when running, it’s pretty solid. I like it because it’s Postgres under the hood. Even extensions work. Happy self-hosting “customer”.
Very Interesting. Glossing over the article, it states HA within a Region. Curious if anyone is successful in running any Database with multi-region HA? How are you doing it?
Fauna can do multi-region HA, across clouds, serverless, with transparent sharding and replication. We get plenty of people struggling with not being able to do low-latency, performant, strict serializability distributed writes. It is designed to be a turnkey hosted solution. https://dev.to/luiseduardocolon/fauna-deep-dive-architecting... - Luis
Somewhat related, but I'm looking at hosting solutions for a hobby SaaS I'm building, and I would like to use google cloud run, but the cloud sql pricing seems extremely expensive if you want HA and are just starting up. They also don't seem to offer any cloud sql as part of their free like like AWS does with RDS, it just seems very odd.
I looked at Planetscale thinking I might be willing to just use MySQL, but even if I were to take a risk with their confusing pricing it turns out there are no foreign keys.
And there is no equivalent of Cloud run on AWS.
It just seems like there is a gap here in when it comes to managed databases, not just GCP laking any kind of "free tier" option, but just a lack of players in the space between free and established business pricing for HA.
If it's a hobby SaaS you are best of just using docker-compose on some VM and pgBackRest. If you have multiple hobby projects, use different databases in the same instance. You can always pull out a single one when issues arise.
Make your life easy, tons of applications are running for years even without (good) backups
> It just seems like there is a gap here in when it comes to managed databases, not just GCP laking any kind of "free tier" option, but just a lack of players in the space between free and established business pricing for HA.
Use serverless cockroachdb. Yugabyte also has a free tier. As does Fauna and Planetscale.
Another option for open source aficionados might be https://www.pgedge.com - with their spock extension they also achieve multi-master replication. Looks very interesting!
This is a completely different story. YugabyteDB is about transparent sharding and shard horizontal scalability. Every table is split into tablets. A tablet can be thought of as a shard but one still thinks of a table as a whole. Each tablet is replicated. Replicas can be placed in different regions, even in different computing environments. Want to run it across clouds or span the cloud and bare metal? Ignoring the cost and minding latency, why not. It’s still logical single database with whole tables. Want to put some rows in selected regions? No problem, still one logical table. Very different from master-replica architecture.
How are major version upgrades handled on PGAF these days? Last time I used it (it's amazing), there wasn't a great path forward for minimal-downtime major version upgrades.
I hosted postgres with cnpg operator in kubernetes. It does everything automagically: backups, wal backups and high-availability. Can't say nothing bad about it, it just worked for me. My only issue is that it releases versions too fast, I'd prefer some more relaxed schedule, but that's probably just me.
On self-hosted you can use the same as what cloud vendors are doing. Patroni or pg_auto_failover to manage single-primary + replicas. Maybe Neon to run an Aurora-like. CitusData... Azure did a lot around to get elasticity. Not easy to do the same. YugabyteDB on Kubernetes can be a cloud-native self-hosted solution
pretty interesting ... covers replication strategies and tools, read replicas, coordinators, sharding, and the new tech stack approach... job well done
Unfortunately Spanner isn’t open source. Yugabyte and Citus are close but have annoying issues. Cockroach isn’t 100% compatible (and has its own issues) and things like FoundationDB which are truly HA and comparable to Spanner in terms of consistency and fault tolerance are not easily plugged into Postgres as the underlying storage engine since sadly it’s only a key value store.
edit: when I say close, I'm talking strictly about HA, not general functionality.
lately I've been thinking of using FoundationDB, which is closest to Spanner in terms of ACID and serializability and mvsqlite.
Then, I was thinking, since SQLite doesn't have online schema changes (nor does mvsqlite) to have a schema such as:
Where Data is a JSON or Proto and Version is an integer. You then could mimic an online schema change by in your application code supporting two adjacent "versions", and then in an eventually consistent manner run [small] transactions to update the Data to the new Version as necessary. You would index Version, and UpdatedAt as necessary to find the rows in the table that are not "migrated."In SQLite you can also create indexes on expressions so technically all of your JSON or Proto could also have indexes.