Personally, if DO don’t have anything new in a status post, I’d prefer seeing an update that says something like “We are continuing to work on the issue. Nothing new to report. Next update in X minutes.” That is a lot easier for me to parse than the text that someone seems to be copy/pasting in each update.
That would be a bad sign that there’s something wrong with the culture. I would hope for a postmortem that identified flaws that genuinely needed to be fixed.
Those are not mutually exclusive and actually a good idea. You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs. That's scaling and redundancy 101 - not sure why it would be something wrong.
> Those are not mutually exclusive and actually a good idea.
The goals “assuring folks that there is isolation” and “identifying flaws that need to be fixed” are somewhat contrary to each other.
The post-mortem should identify flaws in systems, processes, and thinking. It should not try to assure people that there is isolation when there is evidence to the contrary.
> You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs.
This was a multi-regional failure. So, this specific issue is also an isolation problem, among other things. You will want to ensure that this problem doesn’t happen again but you shouldn’t assure that it won’t.
Reading between the lines, it looks like their maintenance system needed to take down several Borg clusters within a single AZ, and their BGP route reflectors all ran from the same set of logical clusters. They'd tried to set up geo-redundancy by having different BGP speakers across different AZs, but they were all parented by the same set of logical clusters, and the maintenance engine descheduled all of them together. Then the network ran okay ("designed to fail static for short periods of time") until the routes expired, after which routes got withdrawn and traffic blackholed.
They realized the issue within an hour.. unfortunately, since they took down multiple replicas of their network control plane, they lost Paxos primary and had to rebuild configuration.
(Disclaimer: I work in Azure, I just find it fascinating to look at Google's RCAs because failure provides an insight into their architecture and risk engineering.)
A post mortem should always be a place to highlight deficiencies in processes and communicate necessary improvements put into place, not to blame. Blame should only occur if the cadence of outages becomes excessive. Complex systems are tricky, and to err is human.
Disclaimer: Ops/infra engineer in a previous life.
It could be DNS. Azure has had an all-region failure due to a single DNS provider outage. It was possible that same DNS provider's outage was also causing problems for GCE and AWS at the same time.
That wouldn't be at the top of my list. We have "Volumes" for databases and they were inaccessible for like 6 hours. I don't think any DNS is involved in mounting these. But hey, there's always a lot of crap hidden behind the scenes :)
This is OT, but I have a droplet on DO and I'm amazed at the amount of malicious traffic it gets. Is it normal for a very private vps to receive thousands of ssh attempts per hour? I have fail2ban installed and the jail is so busy it's quite astounding. Anyone with more web hosting experience that can weigh in?
I work for a web hosting company in Texas, and this is ridiculously common. Any public IP with any public service at all will be poked, prodded, and generally made uncomfortable by every bot and crawler you can think of, trying common password combinations and scanning for common vulnerabilities in popular software. This catches so many of our customers by surprise, who tend to mistakenly believe they're being targeted in some kind of attack. Generally they're not, unless they're running something vulnerable and one of the bots noticed.
Fail2ban is great to at least stem the tide. It's good at slowing down SSH brute forcing, and can be set up to throttle poorly behaved scrapers so your site isn't getting hammered constantly. If you can deal with the inconvenience, it's even better to put services that don't need to be truly public behind an IP whitelist. That stops the vast majority of malicious traffic, most of which is going after the low hanging fruit anyway.
Otherwise, it's kinda just a fact of life. With the good traffic also comes the bad.
Presumably then iptables handles the reject/drop rather than SSH sessions being created that fail at login? (Disclaimer, I don't know exactly how any sshd handles dropping clients who attempt to connect by password when you have set it to cert only; but it seems like dropping at the firewall would be more optimal).
for my DO droplet I also changed the ssh port to a silly-high random port and the last time I checked it reduced the amount of nosy bots knocking at the door to zero.
I used to do so too, but sometimes had problems with very restrictive firewalls killing connections to high/unknown ports when traveling. They would only allow vpns or ssh to connect.
Cheers for weighing in. A whitelist is a good solution, since the sheer amount of attempts is making me uncomfortable. It seems to be accelerating over time as well which is even more disturbing.
> Is it normal for a very private vps to receive thousands of ssh attempts per hour?
Yes. The thing is about the IPv4 space is that it’s really not that big (3,706,452,992 public addresses) so it’s pretty trivial to poke every single one esp if you fine tune your port list.
The most common advice is to hide your private services. Instead of using port 22 for ssh use 23231 instead. It’s a little more annoying but you can also use port knocking. So to open port 22 (or what ever port you like) first you got to poke port 23123 then 7654 then port 39212 within a short period of time then the port knocking software will open up port 22. (Or a combo of both change default port and port knocking)
It won’t stop people “ringing the door bell” to see if anyone is home, but it will help with the then trying to brute force pass the login prompt.
Another really good option is single packet authorization. Which, again, a little more complex than port knocking but also another step up in security.
I work for a hosting company and this is totally normal. Digital Ocean and other VPS providers IP ranges are specifically targeted as many amateurs running servers there.
If you've disabled password logins then just don't worry about it. fail2ban is overkill you can rate limit with firewalld or iptables withou needing extra tools.
Is it normal for a very private vps to receive thousands of ssh attempts per hour?
Well, I haven't bothered looking in a long time. But, back when I first got a cable modem back in the late 90's the malicious access attempts filled up my hard drive in just a couple of weeks. I don't remember the size of the HD, but I can only imagine the situation has gotten much, much worse since then.
Yup this is normal, when I can't change the SSH port (e.g for compatibility) then I switch f2b to permanently ban, which should reduce any incurred load by black-holing instead of attempting authentication as the list grows more comprehensive over time (this wont affect other services e.g apache, in-case a user is unwittingly part of a botnet).
hah, i also tend to up the attempts... If you have more than one server you can always tunnel through one of the others if you lock yourself out. Worst case of course VPS console.
Off topic reply, you should monitor the amount of malicious traffic coming from DO networks too. (I did for a few customers at different isps and its insane)
I took a look at the flags on these stories and am pretty sure they're from users who are tired of "X is down" submissions, which tend to get posted a lot and often to be a little on the trivial side.
However, since several HN users are expressing that this issue is genuinely affecting them, I've turned off flags on the OP about this and merged the comments here.
The "across all regions" part makes this one different for me, and interesting even though I'm not a customer of their block storage. I'm curious about the sequence of events, or design choices, that would cause that.
that would be pretty cool but to have that, you need a high-network-latency solution, i.e., pretty much cold back-up. For some time I thought it's pretty last century option but having been experimenting for some time now, it's the option with lowest impact on system performance. More importantly, it's reasonably resilient.
I've read your comment now about 4 times and all I have come up with is "huh?"
Literally thousands if not millions of organisations operate multi-DC infrastructure across the planet.
Is it harder than setting up a single box in one DC? Yes.
Is it harder than setting up a mini-cluster of boxes in one DC? Yes.
Is it rocket science? No.
Their block storage is such a failure. I’m back and forth with support to automatically delete files with lifecycles for over 2 months now and it’s still not resolved.
Since you're trying to "delete file with lifecycles", I'm quite sure your problem is with their object storage (called Spaces), and not their block storage.
I was always wondering how I can get know proactively if something like this break or some service has an outage. As a result, I have built this tool( http://incidentok.com )
i'm curious about the slack integration. can you provide some more info on what that looks like? eg. just a message in real-time when it goes down? a daily message of statuses? etc. Any sort of customization w/ it?
I currently use a soup of zapier zaps to take care of this problem.
Hey. Thanks
IncidentOK will send message to slack using webhook as soon as incident reported by any product. Didn't thought to send status everyday. But I am open for suggestions
Last night I was testing DO managed Kubernetes cluster with persistent volume claim and the volume took 15 minutes to reattach after the pod is rescheduled to another host. I thought it was just some weird hiccup and went to bed.
The incident report indicated the problem started 4 hours ago (around 9pm GMT) but I was having problem around 4pm. It's definitely not a 2-hour incident.
our disks in London went down at about 8:45pm UTC (10 mins 100% disk utilization alert triggered at 5 to) and DO recovery message was sent out at about 2am UTC. We switched our service (keychest.net) on at 3:15am