Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why PostgreSQL High Availability Matters and How to Achieve It (yugabyte.com)
150 points by ddorian43 on June 14, 2023 | hide | past | favorite | 79 comments


It’s amazing that this isn’t a solved problem, but we have all of this crazy language model stuff.

Unfortunately Spanner isn’t open source. Yugabyte and Citus are close but have annoying issues. Cockroach isn’t 100% compatible (and has its own issues) and things like FoundationDB which are truly HA and comparable to Spanner in terms of consistency and fault tolerance are not easily plugged into Postgres as the underlying storage engine since sadly it’s only a key value store.

edit: when I say close, I'm talking strictly about HA, not general functionality.

lately I've been thinking of using FoundationDB, which is closest to Spanner in terms of ACID and serializability and mvsqlite.

Then, I was thinking, since SQLite doesn't have online schema changes (nor does mvsqlite) to have a schema such as:

   [UUID, Data, Version, CreatedAt, UpdatedAt]
Where Data is a JSON or Proto and Version is an integer. You then could mimic an online schema change by in your application code supporting two adjacent "versions", and then in an eventually consistent manner run [small] transactions to update the Data to the new Version as necessary. You would index Version, and UpdatedAt as necessary to find the rows in the table that are not "migrated."

In SQLite you can also create indexes on expressions so technically all of your JSON or Proto could also have indexes.


Citus is not close to Spanner. No global secondary indexes, no cross-shard ACID (the commit status is eventually consistent). No auto-resharding. Yugabyte and Cockroach have a spanner-like architecture, but different open-source model and postgres-compatibility. What are those annoying issues?


This is true, but the article is talking about HA and Citus is HA, no?


by itself, Citus is just sharding to multiple databases. Then each of this database can be protected like any database with a standby


Honestly, just use Spanner. It's the perfect database*. If you need to trade freshness for savings, put a cache in front of it.

If I were building a new startup in 2023, I would need a mountain of evidence against using Spanner. It's ugly that it locks you into GCP but hey an iPhone locks you into Apple's ecosystem, that's just the price you pay to get good things.

* unless you need timeseries, columnar, FTS, geospatial, graph or something special like that


Spanner is ridiculously expensive. Though I get your point, it’s not like an iPhone. An iPhone is not actually that expensive compared to top end android phones.

Spanner is very expensive even compared to hosted cockroach and RDS (Aurora). That being said I’m sure if you're enterprise it's substantially cheaper.

Spanner generally is about twice the cost as cockroach which seems cheap but the gap in nominal cost only grows with usage.

The other issue is that since there’s no source to self host you also have to pay, unlike cockroach.

Generally with cockroach I’d expect one to self host all instances except a staging/pre-pod and prod. With Spanner all your instances necessarily will be hosted, which means $$.

That being all said, Spanner is worth it if money isn’t an issue


The thing is, there is nothing else like Spanner, unless you want to trust a startup with your database, which opens up Cockroach, Yuga, and TiDB. Databases like RDS don't compare because they don't offer the same guarantees. Nothing even in AWS' and Azure's portfolios is like Spanner.

Back of napkin math I did a couple of months ago: Spanner costs about $1/month for 2 writes/second and 10 reads/second. The minimum is $88 which is 176wps and 880rps. If your startup gets more traffic than that, you have good problems. The simplified guarantee model (fully linearizable and serializable everything; automagic sharding, replication and failover) could save you costs elsewhere.

Low-value high-volume data like logs and search queries can be sent to a cheaper self-hosted timeseries/fts database.


You forgot storage costs and especially bandwidth costs and lockin costs and probably range-queries costs.


I've built my own little pile of evidence against Spanner using it for several years, after previously being on Postgres. It has good potential but isn't there yet. And if you're making a startup, you're probably a long way from the scale where sharded DBs become necessary.


I hate this tired argument. If I'm building a startup I'm building it with the intention to hit scale. If I can use cockroach or yugabyte and get 90% of the benefits of postgres without significant additional cost I'm going to use it. I've been at multiple startups that have had trouble scaling their database and the cost to switch database technologies when you hit scale is massive. And worst of all, the limitations of the technology always impact product development. I don't care what cool features postgres has if I can't do online schema changes.

So yeah, I'd rather spend a few more hours picking an actual scalable technology now than spend what could be 7 figures (it can be a lot more, I've heard of this migration uber had to do that was costing them half a billion a year) and massively impact product development moving to scalable technology..


> If I can use cockroach or yugabyte and get 90% of the benefits of postgres without significant additional cost I'm going to use it

Here's the problem, it's not 90%. Idk about those two, but you sacrifice a lot using Spanner, maybe enough to delay your launch. Any slightly advanced query becomes hard to optimize, for one. Not to dis Spanner, it's just a very different beast.

CEO won't want to hear "launch is delayed, but it'll scale better later," and it's also not really true. In the initial phases, you can't tell what your actual needs will be at scale, so you'll probably have to rework things anyway. Just using Spanner doesn't solve your scaling problems. Maybe you'll need to totally redo your Spanner schema to actually scale, maybe you find you're better off sharding at the application layer than at the DB.


Different teams have different abilities. You can't really make general rules around any of this. Should you use spanner as a default selection for your company's database needs? Maybe.


Some general rules make sense for common use cases. Like, I'll say that a relational DBMS is generally what you want rather than a graph one. For a single-node DB, I'll always use Postgres if I have the choice. For a webserver, can't really go wrong with NodeJS, though you might have reasons to use something else (e.g. cost to run).

But sharding is too use-case specific to prescribe something broadly like Spanner.


both good points, including the devs having different abilities and using a ready-made scalable solution. We're living in the world of having something working that you can test in a few days shrug


> perfect

Doesn't spanner introduce new (very unlikely) failure modes that other databases are not impacted by? The reliance on an external consistency model feels to me like a complete outsourcing of liability that warrants thorough investigation.

Hypothetically, if GPS went down for a prolonged period and/or a bug was found in the TrueTime system, what would happen to the consistency model around Spanner?

I feel like some applications and customers would much rather wait for a synchronous acknowledgement from the actual, live system. An extra 150ms when you are confirming a 6-figure wire transfer could easily be framed as a good thing in most circles.


Spanner offers both linearizability and serializability. They use the non-standard term "external consistency" to describe the union of those two properties.

>Doesn't spanner introduce new (very unlikely) failure modes that other databases are not impacted by?

Curious if you could go into that.


Spanner is great if you have unlimited money


Doesn't Spanner depend on access to globally-synchronized custom atomic clock hardware? i.e. even if it were open source, it wouldn't really be valuable unless you're a data center operator.

As I understand it, it's not a solved problem because there is no silver bullet, but rather trade-offs in every direction, and which solution works for you (including Spanner) is heavily dependent on your use case.


the clocks help but are not strictly required. FoundationDB also achieves external consistency without the clocks, but has limitations, such as a 5 second window for a transaction, due to optimistic concurrency.

the clocks remove such limitations.


There are some fundamentally really hard problems with multi-node DBs, and the usable solutions depend a lot on your use case, so I'm not surprised at the current state.


Out of curiosity, what's wrong with cockroach?


Great summary. Multi-node relational DBs are very much double-edged swords. This article mentions the potential data loss during failover events. I've complained in the past that Heroku Postgres advertises HA like a strict improvement and leaves the actual failover mechanism in the very fine print; the standby master is async, so you can lose data. There's no free lunch here, you either tolerate some data loss or tolerate extra latency before commits.

When you get into sharded DBs, a lot of limitations aren't obvious at first. I've used Citus a long time ago and Spanner much more recently. Spanner feels almost like NoSQL: Each table is like a distributed KV store. Indexes are just tables where the PK is the indexed col(s) and the val is the main table's PK, though you can also store additional cols (aka denormalize) in the index to avoid an extra join. You can't combine two indexes to filter on two cols; you need one composite index. More advanced things like order-limit, WHERE (NOT) IN, and subqueries tend to be slow. The query planner is pretty limited, and often I just force things. You also have to really know what you're doing with the pkeys. Which is all understandable, given its requirements.

I forget the limitations of Citus ACID, but they're significant. Spanner has full ACID, but even basic operations are quite slow. Single-node Postgres actually doesn't have full ACID unless you put it in the slower SERIALIZABLE mode, but you don't really need it.

I agree with those who say it's better to focus on sharding at the application layer if possible. If you can't do that, it's probably due to the underlying nature of your problem, in which case sharding at the DB layer in an efficient way tends to be even harder. Sometimes it makes sense, but it's not magic.


> You can't combine two indexes to filter on two cols; you need one composite index

Can you expand on how you perceive traditional RDMSs to be different for this case?


    CREATE TABLE foo (id bigserial, bar int, baz int);
    CREATE INDEX foobar ON foo(bar);
    CREATE INDEX foobaz ON foo(baz);
    SELECT \* FROM foo WHERE bar = 2 AND baz = 4;
Postgres (and I think MySQL) will use both indexes in the above query*. Spanner can only use one index, which will be slow if there are many non-matching bar=2 or baz=4 rows.

So Spanner needs CREATE INDEX foobarbaz ON foo(bar, baz); Which Postgres could also use, and it'd be a bit faster, but the index-combining is decently fast too and much nicer when you consider a table with like 10 cols and many ways to filter/join.

* https://www.postgresql.org/docs/current/indexes-bitmap-scans...


Bitmap Scan relies on shared buffers. Distributed SQL cannot share the buffers between nodes. BitmapAnd is nice but expensive. I prefer a compound index on foo(bar, baz) but PostgreSQL needs an additional index if there are query with condition on baz only. YugabyteDB, thanks to hybrid scan, can use this single index also for SELECT * FROM foo WHERE baz = 4;


I used your example to show how the 2 indexes solution is better on PostgreSQL, but the one composite index is the best one for YugabyteDB (PostgreSQL on a Spanner-like architecture): https://dev.to/yugabyte/one-fat-index-or-two-indexes-on-each...


Yeah, I rarely use composite indexes in Postgres. Only if I want to squeeze a little more performance out of a frequent combo query.


Forgot to add, any Spanner query that I expect to be reasonably fast for realtime usage has to be very simple. In Postgres I often get away with much more complex ones.


I’m not really familiar with the trade offs of modes with Postgres but why isn’t Serializable necessary?


The default read-committed is good enough for most use cases, explained in https://www.postgresql.org/docs/current/transaction-iso.html (search "it is just right for simpler cases"). If somehow you've designed a schema that causes race conditions in the default mode, you're probably better off changing your design than you are switching modes.

Serializable mode uses some kind of optimistic concurrency, where your transaction might get halfway through then fail because of a conflict with another one, which also fails. Then you have to do retry + random backoff on the client side. Spanner does something similar. Problem with Postgres serializable mode is it's slow and won't scale well if many readers/writers are touching the same data.

I get it if this answer isn't satisfying. Partial ACID is worrisome, and full ACID is expensive.


one of the solutions which made it pretty simple for us to run postgresql in a ha environment (mostly in k8s, but works standalone as well) is zalandos patroni: https://github.com/zalando/patroni it's really solid and worked for us for a few years already. (it also comes with a haproxy config to have a single leader connection)

or for k8s their operator: https://github.com/zalando/postgres-operator (docker image: https://github.com/zalando/spilo) we've also tried other operators which were easier to get started, but they failed miserably (crunchyrolls operator is basically based on the zalando one)


How does patroni can do some self healing after primary db gets down and then up again? Or is manual intervention required? How does it look?


manual intervention is not required. Patroni will just use a Standby as the new master. If the old master will be alive again it will be started as a backup. Of course there might be dataloss if it was used with async replication.

Patroni has great docs for this here: https://patroni.readthedocs.io/en/latest/replication_modes.h... (Actually this is more or less a thing to do with Postgres)


I just implemented master with read replica with the bitnami postgres-repmgr image. It's not perfect, but it works and running my own instances in aws instead of rds is going to save me close to 80% when i also add in purchasing a savings plan. By setting this up I've learned more about postgres than i ever ever wanted too. lol


Just a heads up, we were using the bitnami charts in our first attempt at an HA postgres instance in our k8s cluster, and we're currently figuring out how to move away from them because they had so many issues.

Granted, we might've messed something up and there are lots of factors, but you should manually check your individual nodes every so often to make sure none of them are going out of sync. Need to connect to each one directly, can't rely on what the bouncer/pgpool/loadbalancer is showing you to check the data is the same. It happened repeatedly to us, and wasn't obvious from any of our monitoring. In the end we had to scale down to 1 node while we sort out moving to a different operator.


interesting thanks for the heads up, i really appreciate it! I do need to find some sort of monitoring to ensure the db's are staying insync.


I wished AWS, Google and others would also add a Citus solution. I fear that once migrated to Citus, we face the full price pressure from the Azure monopoly.


Google will never touch Citus because it is AGPL licensed[0], I wouldn't be surprised if the other cloud providers have similar policies.

[0] https://opensource.google/documentation/reference/using/agpl...



Yes, the company, but the plugin is some sort of open source.


Open source stuff are still subject to ownership via copyright.

The license allows you to do many things, but not everything.

The copyright owner on the other had, can do everything.

That’s why many projects nowadays require you to assign copyright to the team before accepting a contribution.


I personally don't consider the AGPL to be a free software nor open source license. The burden it places on operating and using the software sufficiently violates freedom 0 in my view.


Yes infringing on the inherent freedom of corporations to maximize their profit margins is pretty awful


> define open source

denoting software for which the original source code is made freely available and may be redistributed and modified.

> agpl limitations

The AGPL License does not permit sublicensing of the code; that is, you cannot rework or add to the code and then close those changes off to the public.

considering these facts, your opinion is honestly ... pretty dumb.

There is nothing hard about forking the repository and then creating a readonly mirror of your working copy on a public github repo.


> > define open source

> denoting software for which the original source code is made freely available and may be redistributed and modified.

Being honest, I didn't examine the OSD closely when writing the comment, but I still feel AGPL might violate rule #10 for the open source criterion.[0]

However, it's much more clear cut going over to the free software side.[1] Specifically, the explanatory text in the free software philosophy states: "The freedom to run the program means the freedom for any kind of person or organization to use it on any kind of computer system, for any kind of overall job and purpose, without being required to communicate about it with the developer or any other specific entity."

The AGPL requires that you communicate your use of the software, and further forces you to provide access to the source code. For unmodified copies, it might be sufficient to link back to an upstream project. As soon as you modify even a single line, you must now provide your own fork, which will involve setting up some sort of infrastructure. That's assuming that the source code still exists (you might be surprised how often it's lost).

> considering these facts, your opinion is honestly ... pretty dumb.

100% disagree :)

[0] https://opensource.org/osd/ [1] https://www.gnu.org/philosophy/free-sw.html


Google has Spanner. AWS is working on something similar. And both have YugabyteDB in their marketplace. Those are Distributed SQL (Global ACID), not Citus. For DataWarehouse which doesn't need ACID, there are other services.


CockroachDB is available in the Azure/AWS/GCP marketplaces all 3 as managed service or SH option.


Good product. Used it for a multi tenant implementation. Pretty easy to get started with, can get hairy before reaching operational excellence but when running, it’s pretty solid. I like it because it’s Postgres under the hood. Even extensions work. Happy self-hosting “customer”.


Very Interesting. Glossing over the article, it states HA within a Region. Curious if anyone is successful in running any Database with multi-region HA? How are you doing it?


There's another article from them that covers it https://www.yugabyte.com/blog/exploring-multi-region-databas...


In SQL Server you can use a multi-subnet failover cluster.

https://learn.microsoft.com/en-us/sql/sql-server/failover-cl...


We are going down the path of using Azure SQL Hyperscale to address availability concerns:

https://learn.microsoft.com/en-us/azure/azure-sql/database/s...


Fauna can do multi-region HA, across clouds, serverless, with transparent sharding and replication. We get plenty of people struggling with not being able to do low-latency, performant, strict serializability distributed writes. It is designed to be a turnkey hosted solution. https://dev.to/luiseduardocolon/fauna-deep-dive-architecting... - Luis


Somewhat related, but I'm looking at hosting solutions for a hobby SaaS I'm building, and I would like to use google cloud run, but the cloud sql pricing seems extremely expensive if you want HA and are just starting up. They also don't seem to offer any cloud sql as part of their free like like AWS does with RDS, it just seems very odd.

I looked at Planetscale thinking I might be willing to just use MySQL, but even if I were to take a risk with their confusing pricing it turns out there are no foreign keys.

And there is no equivalent of Cloud run on AWS.

It just seems like there is a gap here in when it comes to managed databases, not just GCP laking any kind of "free tier" option, but just a lack of players in the space between free and established business pricing for HA.


If it's a hobby SaaS you are best of just using docker-compose on some VM and pgBackRest. If you have multiple hobby projects, use different databases in the same instance. You can always pull out a single one when issues arise.

Make your life easy, tons of applications are running for years even without (good) backups


> It just seems like there is a gap here in when it comes to managed databases, not just GCP laking any kind of "free tier" option, but just a lack of players in the space between free and established business pricing for HA.

Use serverless cockroachdb. Yugabyte also has a free tier. As does Fauna and Planetscale.


We have FKs and new pricing coming soon.


I've been reading this on reddit for quite a while, what quarter might "soon" be?


Another option for open source aficionados might be https://www.pgedge.com - with their spock extension they also achieve multi-master replication. Looks very interesting!


Or, uh, just use AWS RDS Aurora.


“Multi-AZ” in Amazon RDS and how it may differ from High Availability: https://dev.to/aws-heroes/multi-az-in-amazon-rds-and-how-it-...


This is a completely different story. YugabyteDB is about transparent sharding and shard horizontal scalability. Every table is split into tablets. A tablet can be thought of as a shard but one still thinks of a table as a whole. Each tablet is replicated. Replicas can be placed in different regions, even in different computing environments. Want to run it across clouds or span the cloud and bare metal? Ignoring the cost and minding latency, why not. It’s still logical single database with whole tables. Want to put some rows in selected regions? No problem, still one logical table. Very different from master-replica architecture.


How did that work out for people in the east region yesterday?


Perfectly for me.

(Maybe if I had used IAM auth for the database it wouldn't have?)


not good lol


yeah... no burgers or tacos https://www.theverge.com/2023/6/13/23759857/amazon-aws-down-...

that's WHY you need at least a multi-region HA database for your apps


No Roomba either. No multi-tasking housework while you work.


Or, uh, just self host.


What's the self-hosted Postgres HA story these days?


pg_auto_failover makes it an absolute breeze. I cannot understand how it's not mentioned in the article.

I've been runnning it for over 3 years with great success, and 0 downtime. Not to mention the extreme price difference compared to managed solutions - https://github.com/hapostgres/pg_auto_failover/discussions/6...


How are major version upgrades handled on PGAF these days? Last time I used it (it's amazing), there wasn't a great path forward for minimal-downtime major version upgrades.


Same.

But major upgrades are only available once a year.


What about BU/DR? Is walg still the best?


zero downtime for three years - congrats !


I hosted postgres with cnpg operator in kubernetes. It does everything automagically: backups, wal backups and high-availability. Can't say nothing bad about it, it just worked for me. My only issue is that it releases versions too fast, I'd prefer some more relaxed schedule, but that's probably just me.


On self-hosted you can use the same as what cloud vendors are doing. Patroni or pg_auto_failover to manage single-primary + replicas. Maybe Neon to run an Aurora-like. CitusData... Azure did a lot around to get elasticity. Not easy to do the same. YugabyteDB on Kubernetes can be a cloud-native self-hosted solution


Which cloud vendors are using PGAF?


YugabyteDB can be self-hosted.


pretty interesting ... covers replication strategies and tools, read replicas, coordinators, sharding, and the new tech stack approach... job well done




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: