> Root Cause Analysis > [...] > [List of technical problems] No, the root cause ...

ebiester · on Feb 11, 2017

I am a senior engineer who has seen shit go down. I am quite literally the graybeard.

I am not a GitLab customer, I am not a startup junkie, and I'm usually considered one of the more conservative (in action, not politics) engineers in my peer group in technology adoption.

The cloud is just someone else's computer.

However, I've also seen graybeards who should have known better fuck something up. I've seen a team of smart people who in a moment of crisis made the wrong decision. I am currently in an organization that is full of careful people and have still seen data loss.

I went trawling through LinkedIn for GitLab employees, and they certainly have their fair share of senior engineers. If you want to fault them for being a remote company, that's fine, but is it that different than a fortune 500 company that has developers in the Bay Area, Austin, India, China, Budapest, and remote workers in other locations?

Or is a company only legitimate if it's in an open space in the Valley?

colechristensen · on Feb 11, 2017

I couldn't agree with you more.

As your beard greys you realize absolutely everywhere is a mess. Everybody is an imposter and nothing matches the ideals you think should exist. The most capable people are just as prone to fat fingering critical commands as the greenhorns.

People have the wrong attitude towards failure and it's actually quite harmful. If you don't actively study it and make avoiding failure the #1 priority of your company you're absolutely doomed to commit a serious error at some point, and usually it's fine.

We're talking about a distributed version control system. Half the point is resilience to data loss. Compound that with the final result which was a site down for a day and loss of 6 hours of data. I've lost a day of work before to doing nothing and worked hard for a few hours and accidentally deleted it. If you haven't, you're probably lying to yourself. I didn't fall on my sword. It's just not that big of a deal. If it happens frequently? Sure. But it's going to happen once to a lot of people.

One of the very most important aspects to avoiding failure is being amiable when it happens. Fear of failure causes quite a bit of failure and stupid behavior to try to hide and avoid it.

I also simply don't understand the vitriol towards remote work.

cookiecaper · on Feb 11, 2017

The issue is not that data loss occurred per se, nor is it that destructive accidents and oversights don't happen to senior people. The issue is that GitLab's surprisingly amateur and sloppy practices, many of which are blatantly obvious to people with a medium amount of ops experience, have bled through every aspect of this incident since it first occurred.

They didn't just lose data. They lost data and all of their actual backups were invalid. They had to restore from a system image that was taken for non-backup purposes, and, as luck would have it, was able to function as a backup in this instance. Not having working backups for months-long stretches rises to the level of negligence or incompetence from whomever is supposed to be supervising their infrastructure.

We all know that backups in the general sense are crucial and that they don't get done nearly often enough, but being lazy about backing up the home directory on your laptop is a lot different than allowing the company to sit without working backups for months.

I'm not saying that this doesn't happen to senior engineers who are victims of bad management, but qualified leadership doesn't allow it.

On top of that, it emerges that this condition occurred because they don't have good practices around when to log in to the master database server, they remove binary data directories before they pull down new copies, they don't know how to configure PgSQL and have to do a full standby resync after a couple of hours of high DB load because they don't have WAL archiving, replication slots, or even a semi-sane wal_keep_segments/min_wal_size set, they have no automated backup sanity check (let alone a schedule of human-verified backup restores) and other inadequate monitoring and alarming practices, and do I really need to go on? I could, because in this thread alone there are several other major faux pas mentioned.

I'm not sure how many of these sloppy, amateur errors you want to allow to stack on top of each other before you start thinking that GitLab is semi-responsible for this and that it's not within the typical senior-person margin of error, but it passed that threshold a long time ago for me.

GitLab severely underpays for any candidate not based in a top 10 real estate market, talking like 50-60% under market, because they punish candidates based on how much cheaper the real estate in their home market is than in New York City. The consensus is that this impedes their ability to obtain good talent and I would say that the events of the last couple of weeks have demonstrated that with spectacular clarity.

At least in my case, the impression has nothing to do with their operation as a remote company -- I'm a full-time remote worker and I learned about GitLab's atrocious salary formulae when I was checking them out as a potential employer because I wanted to move to an all-remote company (instead of the partially-remote company I work in now).

I'm sure that most of GitLab's engineers are good engineers relative to their experience levels. I'm also sure a small handful who accidentally align with their salary formula are senior in their particular fields. And I'm thirdly sure that no one with any inkling of experience in running a stable, reliable, production-level service and infrastructure has been allowed any fractional amount of influence in their infrastructure and deployment procedures.

gumby · on Feb 11, 2017

> Mark my words, the board members from the VC firms will be removed by the VC partners

I have never, ever, seen this. Every firm has its own internal politics [1] but rarely if ever would they do this. They are more likely to just ignore it.

There is a belief that it takes a decade to tell if a VC is any good or not, and that includes "learning experiences" (all on the LP's dime of course).

> Then VC firms will put an experienced CEO and CTO in place

Now this I have seen. It even works sometimes (e.g. Eric Schmidt/Google).

[1] I have had a firm invest in which all decisions were made by a single partner. I also had a firm invest (a sizable sum!) in which other senior partners never met me and only learned what my company even did when I gave a presentation at one of their LP meetings. Also some very large funds allow senior partners to make small seed investments ("science projects") without formal approval from the partnership.

pfarnsworth · on Feb 11, 2017

At a startup I worked at, the VP of Engineering was brilliant. He was probably the smartest person I've ever worked for, and the most hard working. He was online almost all hours of the day, working. He also insisted on a 9-to-5 schedule for all the engineers, because he believed that killing your engineers with work was not a scalable way to build a team. He was great.

But the first month I was there, I kept pressing him on what our disaster recovery plan was. His answers were weak at best. It was never tested, and he only had broad ideas of how much time it would take for a full recovery. I don't understand his reticence for testing full disaster recovery, but as everyone knows, unless you have tested DR, you don't have DR.

It was very scary, but in the several years I was there, we never had the database go down hard and lose data. But that was more blind luck than anything else. If we had a data outage, it would have probably been worse than the Gitlab's outage by far.

pzh · on Feb 11, 2017

Maybe in his own cocky way, his disaster recovery plan was to be smart enough to never have a disaster. The problem with this approach is that it only works for a small company, and many people figure out the hard way that smarts don't scale.

user5994461 · on Feb 11, 2017

Or maybe it was a small startup where him and the poster were the only tech guys and they struggled to performs all the tasks to be done, not limited to backup and DR.

pzh · on Feb 11, 2017

I know somebody who knows the guy who deleted the data that caused the outage, and in his words (the guy who knows the GitLab employee), the GitLab engineer is one of the smartest people he's ever known--in fact, he's actually brilliant. So you can rest assured that the data loss wasn't caused by a an inexperienced kid.

grzm · on Feb 11, 2017

You're aware he's in this thread, right? And the hearsay matches up with his comments, from what I've read.

sytse · on Feb 11, 2017

Yes, he's one of our smartest people.

elliotec · on Feb 11, 2017

They also openly pay well below market rate, so they should expect to get what they pay for.

bogomipz · on Feb 11, 2017

Can you elaborate on that?

For instance I just looked at their jobs listings and they advertise a range for annual compensation, the high end of that range looks about right for the locations I checked SF, and London in the UK for PE positions.

https://about.gitlab.com/jobs/production-engineer/

cookiecaper · on Feb 11, 2017

Massive thread on this yesterday (with multiple appearances from GitLab staffers) at https://news.ycombinator.com/item?id=13608463 .

The tl;dr is that they start out with a kind of OK but not very interesting rate for New York City. Then they punish residents of other cities based on how much cheaper the rent in their city is than the rent in NYC.

For example, if the cost of rent in your city is 35% the cost of rent in NYC (as determined by the third-party rent index they reference), your salary multiplier will be 0.35, meaning GitLab will offer someone in NYC 130k for that job, but they'll only offer you 45.5k. The experience modifiers range from -20% to +20% so they're not going to help much.

As NYC is literally one of the top five most expensive real estate markets on the planet, most non-NYC cities get totally pummeled by GitLab's salary calculator, and the result is what we see here: an enterprise with $30M in funding that can't figure out how to make backups.

elliotec · on Feb 11, 2017

Using that same job calculator that you used (for the developer position) near the bottom of this page: https://about.gitlab.com/jobs/developer/, the rates are way way off from my area.

They are out of their minds if they think the top rate for a Senior engineer in Salt Lake City with above average experience is $78k. Most other areas also seem pretty low from what I've looked into but I suppose a few areas could be outliers in their calculator.

Their docs on the calculator are here: https://about.gitlab.com/handbook/people-operations/global-c...

It's a bit strange that they cut or raise your pay by that calculator when you move cities.

bogomipz · on Feb 11, 2017

>It's a bit strange that they cut or raise your pay by that calculator when you move cities."

That is odd. Usually if you make more based on your previous location a company won't actually claw anything back you just will likely not be getting any raises.

Aeolun · on Feb 11, 2017

It's a bit odd that living in Tokyo you get 90k. While living in SF you get 220k. I'm fairly certain the rent difference isn't 10k per month.

bbcbasic · on Feb 11, 2017

Supply and demand

was_boring · on Feb 11, 2017

Of what? A developer in that location? What is that benefit?

Having worked on both coasts, I can say that the quality is the same. Culture and quantity appears to be the biggest difference.

bbcbasic · on Feb 11, 2017

Yep a developer in that location. The market puts a high price on that. Regardless of same quality people being cheaper elsewhere. If you think the market is wrong then go try to arbitrage it. Maybe you can...

epmatsw · on Feb 11, 2017

Wow, their offers for a junior dev in Madison is like a third of Epic's starting salary. That's brutal...

lojack · on Feb 11, 2017

Maxed out everything in my city (Cleveland) before I was able to reach a salary I'd consider, and I'm hardly a Senior engineer.

I wouldn't call their salary range a spit in the face, but its probably around $20k below market rates where I live. Benefits are competitive in my opinion.

user5994461 · on Feb 11, 2017

Well, it's hard to judge the algorithm. Some numbers are decent, if you are given "lead with lots of experience".

1) It's hard to tell how they assign rank. If they accept any programmer that shows up, the rates are fine. If they have Google level interviews to filter for only Google level candidates, who will join at "junior" level, it's terrible.

2) The numbers are in dollars, thus they are utterly meaningless. It's not a job, it's gambling with the exchange rate and the exchange fees.

lopopolo · on Feb 11, 2017

Compensation is more than just salary

lojack · on Feb 11, 2017

Read through their benefits, they are competitive, but nowhere near enough to make up for the difference in salary.

gech · on Feb 11, 2017

Didn't Samsung have some phones catch on fire? Didn't Delta go down recently?

Cyph0n · on Feb 11, 2017

Nah, let's just all board the anti-GitLab train. I honestly don't understand why the HN crowd has their panties in a knot. HN wasn't this bad even during the VW fiasco.

developer2 · on Feb 11, 2017

>> the root cause is you have no senior engineers who have been through this before

They openly publish their database hostnames in this postmortem (db1.cluster.gitlab.com and db2.cluster.gitlab.com). These actually have public DNS that resolves. The last straw: port 22 on each server is running an open sshd server (the fact that password auth is disabled is of little consolation).

A production database server should NEVER HAVE a public IP address to start with. This is simply unacceptable and proves they don't have a single person qualified to handle infrastructure. Their only concern is that their developers can ssh into every production server without having to deal with vpns or firewalls.

Huge red flag that your data cannot be trusted.

milesrout · on Feb 11, 2017

I can't tell if this is some kind of joke or not.

There's absolutely nothing wrong, whatsoever, with having a public IP address on a production database server.

cookiecaper · on Feb 11, 2017

I wouldn't say there's "absolutely nothing wrong with it". Someone may have a valid use case to leave a database server exposed to the public internet, but they probably don't.

The only things that should be public facing are things that clients need to access directly. In most cases, that's just an HTTP server. In GitLab's case, it's an HTTP server and a git server.

One of the most important principles of infrastructure security is to minimize the attack surface. No matter how locked down you have it, there are always zero days and other exploits out there. This is a concern even if you block the database port at the firewall but leave some other services (like SSH) open; if any of those services get compromised, it has the potential to allow for the compromise of the rest of the box.

If there's no need for the public to connect to the server, there's no need to take the risk of leaving any of its services open. And if there's no need to have its services open, there's no need for the box to even be addressable from the public internet (i.e., no reason to have a public IP address).

Put the server on the internal network and connect over a secure mechanism like VPN and not only do you not have to worry about strangers connecting to your servers, you don't have to worry about whitelisting or blacklisting individual IPs in your firewall (instead, you whitelist the applicable internal subnets, which should also be restricted based on resource access level), you don't have to worry about your firewall's rules getting wiped for whatever reason and accidentally letting the whole planet in (common if you use iptables, most distros manually require the admin to configure iptables-restore to run on boot), don't have to worry about someone zero-daying your SSH or FTP daemon, don't have to worry about the box being affected by network-level attacks like DDoS, which can sometimes target whole subnets, and so forth. Just much tidier and safer all around.

user5994461 · on Feb 11, 2017

Little correction: HTTP servers should not be publicly accessible. Only a couple of load balancers need to be accessible.

milesrout · on Feb 12, 2017

Firstly, what do you think HTTP load balancers are, if not HTTP servers?

Secondly, why would anyone use a load balancer? Almost every website in existence doesn't need a load balancer.

user5994461 · on Feb 12, 2017

A web server runs applications, a load balancer distributes incoming traffic. They have different purposes.

Almost every public website in existence use multiple web servers and load balancers to balance between them. That's the only way to do failover and handle more traffic than a single box can take.

NPegasus · on Feb 11, 2017

  > There's absolutely nothing wrong, whatsoever, with having a public IP address on a production database server.

Umm, yes there is. Explain why you would ever need your database server to be publicly reachable. The only situations I can think of are if you're running everything off a single server, which of course is not relevant to this thread, or if you don't have a suitable gateway, which is rare and also not relevant to this thread.

lopopolo · on Feb 11, 2017

Exactly. The technical problems are not the root cause. How did the existing processes (or lack thereof) fail? How did the organization fail?

AsyncAwait · on Feb 11, 2017

Like we haven't seen similar screwups at other, more 'proffesional' companies. They may shrug it under the rug better, but that's all.

dorianm · on Feb 11, 2017

Exactly, I feel like it was nobody's job to make sure everything was resilient to failures