DigitalOcean block storage is down

CaliforniaKarl · on Sept 30, 2019

Personally, if DO don’t have anything new in a status post, I’d prefer seeing an update that says something like “We are continuing to work on the issue. Nothing new to report. Next update in X minutes.” That is a lot easier for me to parse than the text that someone seems to be copy/pasting in each update.

iamsb · on Oct 1, 2019

Would be great if statuspage.io has a button when pushed publishes message similar to your suggestion.

kyledrake · on Sept 30, 2019

What unholy thing did they do that broke it across 12 different datacenters, good lord.

alexeldeib · on Oct 1, 2019

This does seem to indicate a notable lack of isolation for the blast radius between DO datacenters. Would be interesting to see the post mortem.

protomyth · on Oct 1, 2019

I get the feeling that whoever writes the post-mortem is going to have a bit of pressure to assure folks that there is isolation going forward.

klodolph · on Oct 1, 2019

That would be a bad sign that there’s something wrong with the culture. I would hope for a postmortem that identified flaws that genuinely needed to be fixed.

viraptor · on Oct 1, 2019

Those are not mutually exclusive and actually a good idea. You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs. That's scaling and redundancy 101 - not sure why it would be something wrong.

klodolph · on Oct 1, 2019

> Those are not mutually exclusive and actually a good idea.

The goals “assuring folks that there is isolation” and “identifying flaws that need to be fixed” are somewhat contrary to each other.

The post-mortem should identify flaws in systems, processes, and thinking. It should not try to assure people that there is isolation when there is evidence to the contrary.

> You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs.

This was a multi-regional failure. So, this specific issue is also an isolation problem, among other things. You will want to ensure that this problem doesn’t happen again but you shouldn’t assure that it won’t.

protomyth · on Oct 1, 2019

I would think having all zones go down is a flaw that genuinely needed to be fixed.

klodolph · on Oct 1, 2019

That’s not the flaw, that’s the outcome. The purpose of a post-mortem is to identify the flaws that caused that outcome, and ways to fix those flaws.

swsieber · on Oct 1, 2019

Whoever broke it is going to feel significant pressure to actually isolate things too.

notyourday · on Oct 1, 2019

Propagating mistakes across all the things is devops

jbarham · on Oct 1, 2019

DevOps Borat: "To make error is human. To propagate error to all server in automatic way is #devops."

https://twitter.com/DEVOPS_BORAT/status/41587168870797312

nodesocket · on Oct 1, 2019

Google Cloud recently had a global outdate. DevOps tools that interact with all resources across data centers are primarily the culprit.

sterlind · on Oct 1, 2019

Google's RCA is here, I believe: https://status.cloud.google.com/incident/cloud-networking/19...

Reading between the lines, it looks like their maintenance system needed to take down several Borg clusters within a single AZ, and their BGP route reflectors all ran from the same set of logical clusters. They'd tried to set up geo-redundancy by having different BGP speakers across different AZs, but they were all parented by the same set of logical clusters, and the maintenance engine descheduled all of them together. Then the network ran okay ("designed to fail static for short periods of time") until the routes expired, after which routes got withdrawn and traffic blackholed.

They realized the issue within an hour.. unfortunately, since they took down multiple replicas of their network control plane, they lost Paxos primary and had to rebuild configuration.

(Disclaimer: I work in Azure, I just find it fascinating to look at Google's RCAs because failure provides an insight into their architecture and risk engineering.)

temikus · on Oct 3, 2019

Or just bad rollout procedures.

alexeldeib · on Oct 3, 2019

potato, potato ;)

bluedino · on Sept 30, 2019

Probably the old "one command ran on everything"

astrodust · on Oct 1, 2019

tmux is a dangerous tool in the wrong hands.

rubbingalcohol · on Oct 1, 2019

to be fair, it's dangerous even in the best hands. mistakes happen but business processes need to be in place to prevent catastrophes...

every time i see something like this, my inclination is to blame the CTO, not the engineer who pulled the trigger.

toomuchtodo · on Oct 1, 2019

A post mortem should always be a place to highlight deficiencies in processes and communicate necessary improvements put into place, not to blame. Blame should only occur if the cadence of outages becomes excessive. Complex systems are tricky, and to err is human.

Disclaimer: Ops/infra engineer in a previous life.

astrodust · on Oct 1, 2019

I wonder how many outages these days start with something like "kubectl apply" and then things go horribly awry.

solotronics · on Oct 1, 2019

We can blame whoever we want but you better believe shit rolls downhill at most places.

GhettoMaestro · on Oct 1, 2019

Until it is a big enough F-up that an executive's head must roll.

mdaniel · on Oct 1, 2019

There's a famous corollary to that approach: "Fire him? Why, I just spent 10 million dollars _educating_ him"

(regrettably I can't find any evidence it's a true (quote|story), but I enjoy the sentiment)

nonbirithm · on Oct 1, 2019

It could be DNS. Azure has had an all-region failure due to a single DNS provider outage. It was possible that same DNS provider's outage was also causing problems for GCE and AWS at the same time.

https://news.ycombinator.com/item?id=19812919

dc352 · on Oct 1, 2019

That wouldn't be at the top of my list. We have "Volumes" for databases and they were inaccessible for like 6 hours. I don't think any DNS is involved in mounting these. But hey, there's always a lot of crap hidden behind the scenes :)

markonen · on Oct 1, 2019

I would be absolutely amazed if DNS was not involved in mounting a block storage volume.

pmlnr · on Oct 1, 2019

Bad puppet/ansible/etc commit is the most probably explanation.

mdellavo · on Oct 1, 2019

dee ennn esss

hinkley · on Oct 1, 2019

A bug that has no obvious side effects that only became visible once all data centers were upgraded?

Happens. Statistics are hard.

pastrami_panda · on Oct 1, 2019

This is OT, but I have a droplet on DO and I'm amazed at the amount of malicious traffic it gets. Is it normal for a very private vps to receive thousands of ssh attempts per hour? I have fail2ban installed and the jail is so busy it's quite astounding. Anyone with more web hosting experience that can weigh in?

zeta0134 · on Oct 1, 2019

I work for a web hosting company in Texas, and this is ridiculously common. Any public IP with any public service at all will be poked, prodded, and generally made uncomfortable by every bot and crawler you can think of, trying common password combinations and scanning for common vulnerabilities in popular software. This catches so many of our customers by surprise, who tend to mistakenly believe they're being targeted in some kind of attack. Generally they're not, unless they're running something vulnerable and one of the bots noticed.

Fail2ban is great to at least stem the tide. It's good at slowing down SSH brute forcing, and can be set up to throttle poorly behaved scrapers so your site isn't getting hammered constantly. If you can deal with the inconvenience, it's even better to put services that don't need to be truly public behind an IP whitelist. That stops the vast majority of malicious traffic, most of which is going after the low hanging fruit anyway.

Otherwise, it's kinda just a fact of life. With the good traffic also comes the bad.

davrosthedalek · on Oct 1, 2019

I always switch my outward-facing ssh servers to key-only. Is there any advantage for running fail2ban additionally?

pbhjpbhj · on Oct 1, 2019

Presumably then iptables handles the reject/drop rather than SSH sessions being created that fail at login? (Disclaimer, I don't know exactly how any sshd handles dropping clients who attempt to connect by password when you have set it to cert only; but it seems like dropping at the firewall would be more optimal).

pnutjam · on Oct 1, 2019

It makes your logs easier to read because they aren't full of ssh failures.

ac2u · on Oct 1, 2019

for my DO droplet I also changed the ssh port to a silly-high random port and the last time I checked it reduced the amount of nosy bots knocking at the door to zero.

davrosthedalek · on Oct 1, 2019

I used to do so too, but sometimes had problems with very restrictive firewalls killing connections to high/unknown ports when traveling. They would only allow vpns or ssh to connect.

pastrami_panda · on Oct 1, 2019

Cheers for weighing in. A whitelist is a good solution, since the sheer amount of attempts is making me uncomfortable. It seems to be accelerating over time as well which is even more disturbing.

Crosseye_Jack · on Oct 1, 2019

> Is it normal for a very private vps to receive thousands of ssh attempts per hour?

Yes. The thing is about the IPv4 space is that it’s really not that big (3,706,452,992 public addresses) so it’s pretty trivial to poke every single one esp if you fine tune your port list.

The most common advice is to hide your private services. Instead of using port 22 for ssh use 23231 instead. It’s a little more annoying but you can also use port knocking. So to open port 22 (or what ever port you like) first you got to poke port 23123 then 7654 then port 39212 within a short period of time then the port knocking software will open up port 22. (Or a combo of both change default port and port knocking)

It won’t stop people “ringing the door bell” to see if anyone is home, but it will help with the then trying to brute force pass the login prompt.

lunchables · on Oct 1, 2019

Another really good option is single packet authorization. Which, again, a little more complex than port knocking but also another step up in security.

https://www.cipherdyne.org/fwknop/docs/SPA.html

ollybee · on Oct 1, 2019

I work for a hosting company and this is totally normal. Digital Ocean and other VPS providers IP ranges are specifically targeted as many amateurs running servers there.

If you've disabled password logins then just don't worry about it. fail2ban is overkill you can rate limit with firewalld or iptables withou needing extra tools.

pmlnr · on Oct 1, 2019

It is "normal". Even my home fix IP gets it without any service running on it other than ssh.

mercer · on Oct 1, 2019

I had the same experience on DO as well as a few other providers.

Changing the ssh port to something in the 50_000 range drastically reduced the number of attempts and left my logs much cleaner :).

gmiller123456 · on Oct 1, 2019

   Is it normal for a very private vps to receive thousands of ssh attempts per hour?

Well, I haven't bothered looking in a long time. But, back when I first got a cable modem back in the late 90's the malicious access attempts filled up my hard drive in just a couple of weeks. I don't remember the size of the HD, but I can only imagine the situation has gotten much, much worse since then.

tomxor · on Oct 1, 2019

Yup this is normal, when I can't change the SSH port (e.g for compatibility) then I switch f2b to permanently ban, which should reduce any incurred load by black-holing instead of attempting authentication as the list grows more comprehensive over time (this wont affect other services e.g apache, in-case a user is unwittingly part of a botnet).

pnutjam · on Oct 1, 2019

Fatfingered typist beware...

jaster · on Oct 1, 2019

Better use a password manager in this case (or even better, use public key auth!)

tomxor · on Oct 1, 2019

hah, i also tend to up the attempts... If you have more than one server you can always tunnel through one of the others if you lock yourself out. Worst case of course VPS console.

mobilemidget · on Oct 1, 2019

Off topic reply, you should monitor the amount of malicious traffic coming from DO networks too. (I did for a few customers at different isps and its insane)

hartator · on Sept 30, 2019

Not sure why the previous incident page got flagged. This is the new one.

It's affecting us for real. Making almost our whole service - serpapi.com - down. As we are storing database files on block storage volumes.

dang · on Sept 30, 2019

I took a look at the flags on these stories and am pretty sure they're from users who are tired of "X is down" submissions, which tend to get posted a lot and often to be a little on the trivial side.

However, since several HN users are expressing that this issue is genuinely affecting them, I've turned off flags on the OP about this and merged the comments here.

tyingq · on Sept 30, 2019

The "across all regions" part makes this one different for me, and interesting even though I'm not a customer of their block storage. I'm curious about the sequence of events, or design choices, that would cause that.

astrodust · on Oct 1, 2019

I reported it and they were like "what? oh..."

Then the status page changed and as things got worse, the dashboard page got an announcement as well.

louwrentius · on Oct 1, 2019

Isn't Digital Ocean running Ceph for their block storage?

I would wonder - as others suggested - that they may have stretched the cluster across datacenters ?!

Would be interested in the post-mortem.

ngrilly · on Oct 1, 2019

Yes, DO uses Ceph: https://blog.digitalocean.com/why-we-chose-ceph-to-build-blo...

jacquesm · on Oct 1, 2019

Thank you Digital Ocean for once again proving that 'The Cloud' is not a backup.

vinw · on Oct 1, 2019

'The Cloud' is _a_ backup. Just don't let it be your only backup!

pnutjam · on Oct 1, 2019

It's not backed up if you don't have 3 copies.

lunchables · on Oct 1, 2019

The saying we always use is "If it is not in 3 places, it doesn't exist."

And another: "3 copies, at least one offsite"

stephenr · on Oct 1, 2019

This is your weekly reminder that anything you want to be reasonably “HA” should span multiple vendors in multiple DCs.

dc352 · on Oct 1, 2019

that would be pretty cool but to have that, you need a high-network-latency solution, i.e., pretty much cold back-up. For some time I thought it's pretty last century option but having been experimenting for some time now, it's the option with lowest impact on system performance. More importantly, it's reasonably resilient.

stephenr · on Oct 1, 2019

I've read your comment now about 4 times and all I have come up with is "huh?"

Literally thousands if not millions of organisations operate multi-DC infrastructure across the planet.

Is it harder than setting up a single box in one DC? Yes. Is it harder than setting up a mini-cluster of boxes in one DC? Yes. Is it rocket science? No.

Bombthecat · on Oct 1, 2019

Yeah, the myth that "just use aws" to have 99, 9999999 percent uptime is coming to an end...

stephenr · on Oct 1, 2019

Oh I'm sure the myth will persist for many years.

simplehuman · on Oct 1, 2019

Anyone have a review of using DO k8s or DO managed DB in production?

unilynx · on Oct 4, 2019

DigitalOcean just posted a post-mortem on http://status.digitalocean.com/incidents/g76kgjxqrzxs

(the same url)

privateSFacct · on Oct 1, 2019

Higher latency (per status) is not end of world especially if it’s just “may experience” higher latency.

erikrothoff · on Oct 1, 2019

That wording kinda ticked me off because our volume was completely inaccessible. Rebooting did not mount it at all.

imglorp · on Oct 1, 2019

Hrm, Atlassian BitBucket is also down. Just a coincidence? Does BB use DO?

https://bitbucket.status.atlassian.com/incidents/4t1pkwrdtl8...

iamaelephant · on Oct 1, 2019

BitBucket definitely doesn't use DO.

seaghost · on Oct 1, 2019

Their block storage is such a failure. I’m back and forth with support to automatically delete files with lifecycles for over 2 months now and it’s still not resolved.

ngrilly · on Oct 1, 2019

Since you're trying to "delete file with lifecycles", I'm quite sure your problem is with their object storage (called Spaces), and not their block storage.

sb8244 · on Oct 1, 2019

It looks like they have just updated it as resolved and monitoring.

sunasra · on Oct 1, 2019

I was always wondering how I can get know proactively if something like this break or some service has an outage. As a result, I have built this tool( http://incidentok.com )

sunsetMurk · on Oct 1, 2019

great idea - going to give it a whirl this week.

i'm curious about the slack integration. can you provide some more info on what that looks like? eg. just a message in real-time when it goes down? a daily message of statuses? etc. Any sort of customization w/ it?

I currently use a soup of zapier zaps to take care of this problem.

sunasra · on Oct 3, 2019

Hey. Thanks IncidentOK will send message to slack using webhook as soon as incident reported by any product. Didn't thought to send status everyday. But I am open for suggestions

Message looks like this https://imgur.com/jjbMKj8

sodosopa · on Oct 1, 2019

So that’s why bot attacks and spam traffic was lower.

golanggeek · on Sept 30, 2019

This is really down for more than 2 hours!!!

sondh · on Oct 1, 2019

Last night I was testing DO managed Kubernetes cluster with persistent volume claim and the volume took 15 minutes to reattach after the pod is rescheduled to another host. I thought it was just some weird hiccup and went to bed.

The incident report indicated the problem started 4 hours ago (around 9pm GMT) but I was having problem around 4pm. It's definitely not a 2-hour incident.

dc352 · on Oct 1, 2019

our disks in London went down at about 8:45pm UTC (10 mins 100% disk utilization alert triggered at 5 to) and DO recovery message was sent out at about 2am UTC. We switched our service (keychest.net) on at 3:15am

irfanbaigse · on Oct 1, 2019

DigitalOCean bad experience

jbverschoor · on Oct 1, 2019

Their ad was “you’ve been developing like a beast and your app is ready to go live”

DO is a nice thing to play around with and maybe launch something, but I wouldn’t run full production on it.