The mistake I've seen here is having the "health check" do little or nothing. In...

trjordan · on April 22, 2018

Outlier detection also fixes this, without having to implement health checks that replicate the whole functionality (complexity) of the rest of the service.

10 500s in a row? Go to the timeout chair until you get better.

s_kilk · on April 22, 2018

At work we have two endpoints on every service, `/status` and `/health-check`.

The former is basically a "return true" endpoint, which can tell you if the service is alive and reachable. The latter will usually do something like "select 1;" from any attached databases and only succeed if everything is OK.

randerson · on April 22, 2018

And the former is the one you want your Load Balancers to be checking. With a deep health check, even a brief database outage will cause every web server to be taken out of rotation, and then you're completely down for at least as many health check intervals as it takes for the LB to consider a host healthy again. Same goes for any other shared resource that is likely to affect all web servers if it becomes unavailable.

vasco · on April 22, 2018

Presumably your service would not be in a very useful state anyway if the main data store it needs to function is out, so which kind of health-check you use will depend on the failure mode you want to expose to your users.

azernik · on April 22, 2018

Or you just have smarter load balancers, that realize when all of their servers are acting funky and stop taking servers out of rotation.

Terr_ · on April 22, 2018

Or to riff on this idea further: You make the concept of a node's "relative stability" part of the load-balancer logic the same way that "relative idleness" is a factor.

Then if all 100/100 nodes get taken down by some shared problem, the system simply degenerates into picking the idle-est of the 100.

sammorrowdrums · on April 23, 2018

That's not necessarily true. In kubernetes, for example, you have Liveness and Readiness probes, which can both have a period, timeout, initial delay and importantly a number of failures before you kill the service and spawn a new one.

This allows you to have less frequent checks that are more in-depth, and basic ones that are just the `return true` type.

I guess you're correct that you can have a widespread db outage that makes many of them fail, but then there should be new ones coming very fast, as you can set the deployment minimum for services taking requests. I think you can get very close to stability even in this circumstance.

https://kubernetes.io/docs/tasks/configure-pod-container/con...

gaius · on April 22, 2018

Databases are 10-100x as reliable as the application tier in my experience

qooleot · on April 22, 2018

Although the OP didn't describe as such, but health checks that do a simple query are also testing the connectivity to the db. Someone might have hardcoded a db address in an environment variable, or there are connection pooling issues.

aaronmu · on April 23, 2018

The best way I can think of is to aggregate errors over time, categorize them and build a health check around those metrics.

nathan_f77 · on April 23, 2018

There's a great gem for Rails that you can use for comprehensive health checks: https://github.com/ianheggie/health_check

It can check the database connection, redis, cache, email, up-to-date migrations, and S3 credentials.