Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The mistake I've seen here is having the "health check" do little or nothing. In many web services, I've seen a /ping or /health api that the load balancer calls to ask "are you healthy?". But people get rushed or lazy or requirements change, and they wind up with that api just doing "return true".

Now you've got a host that can't access a database or can't do much of anything- but it can return true!

Health check APIs should always do some level of checks on dependencies and complex internal logic (maybe a few specific unit/integration tests?) to ensure things are truly healthy.



Outlier detection also fixes this, without having to implement health checks that replicate the whole functionality (complexity) of the rest of the service.

10 500s in a row? Go to the timeout chair until you get better.


At work we have two endpoints on every service, `/status` and `/health-check`.

The former is basically a "return true" endpoint, which can tell you if the service is alive and reachable. The latter will usually do something like "select 1;" from any attached databases and only succeed if everything is OK.


And the former is the one you want your Load Balancers to be checking. With a deep health check, even a brief database outage will cause every web server to be taken out of rotation, and then you're completely down for at least as many health check intervals as it takes for the LB to consider a host healthy again. Same goes for any other shared resource that is likely to affect all web servers if it becomes unavailable.


Presumably your service would not be in a very useful state anyway if the main data store it needs to function is out, so which kind of health-check you use will depend on the failure mode you want to expose to your users.


Or you just have smarter load balancers, that realize when all of their servers are acting funky and stop taking servers out of rotation.


Or to riff on this idea further: You make the concept of a node's "relative stability" part of the load-balancer logic the same way that "relative idleness" is a factor.

Then if all 100/100 nodes get taken down by some shared problem, the system simply degenerates into picking the idle-est of the 100.


That's not necessarily true. In kubernetes, for example, you have Liveness and Readiness probes, which can both have a period, timeout, initial delay and importantly a number of failures before you kill the service and spawn a new one.

This allows you to have less frequent checks that are more in-depth, and basic ones that are just the `return true` type.

I guess you're correct that you can have a widespread db outage that makes many of them fail, but then there should be new ones coming very fast, as you can set the deployment minimum for services taking requests. I think you can get very close to stability even in this circumstance.

https://kubernetes.io/docs/tasks/configure-pod-container/con...


Databases are 10-100x as reliable as the application tier in my experience


Although the OP didn't describe as such, but health checks that do a simple query are also testing the connectivity to the db. Someone might have hardcoded a db address in an environment variable, or there are connection pooling issues.


The best way I can think of is to aggregate errors over time, categorize them and build a health check around those metrics.


There's a great gem for Rails that you can use for comprehensive health checks: https://github.com/ianheggie/health_check

It can check the database connection, redis, cache, email, up-to-date migrations, and S3 credentials.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: