A bit of context missing from the article. We are a small shop with a few hundreds servers. At core, we're running a financial system moving around multi-million dollars per day (or billions per year).
It's fair to say that we have higher expectations than average and we take production issues rather (too?) seriously.
I updated and fixed a few points mentioned in the comments. (Notably: CoreOS).
Overall, it's "normal" that you didn't experience all of these issues if you're not using docker at scale in production and/or if you didn't use it for long.
I'd like to point out that these are issues and workarounds happening over a period of [more than] a year, summarized all together in a 10 minutes read. It does amplify the dramatic and painful aspect.
Anyway, the issues from the past are already in the past. The most important section is the Roadmap. That's what you need to know to run Docker (or use auto scaling groups instead).
Overlayfs like aufs is a layered filesystem but overlayfs got merged into the kernel. Aufs inpsite of a lot of effort was not merged. Before merge the module was overlayfs, after merge it was renamed to overlay.
I think its the responsibility of Docker users to understand the technologies they are using.
Docker has got into the bad habit of wrapping open source Linux technologies and promoting them in a way that makes it feel like Docker invented it. They did it to LXC and they are doing it to aufs and overlayfs. The HN community is far too vested in Docker to offer any real scrutiny and is very much a part of this hijack.
What is a Docker overlayfs driver? How is simply mounting a ready made overlayfs filesytem already present in the kernel or aufs a driver? These terms not only mislead but prevent recognition of the work of the authors of overlayfs and aufs.
They also require scrutiny as layered filesystems have tons of issues and the only way these can be resolved is engaging with the developers who most dockers users don't even know about. Docker can't solve these issues only work around them.
You need to research Red Hat's offerings in this space (disclaimer .. 20 year RH vet). OpenShift is how the banks I work with are consuming kubernetes/container innovation. It's likely they have similar requirements, and they're loving it.
You got me beat (Biltmore's a year short of 20 years, IIRC, oh yesteryear.) I tried searching for RH and banks and other than a landing page re: RBS and a few quotes it was particularly light on details. RH crushed it with the RHEL and JBoss model[1] (pretty much the paragon of "ethical open-source commercialization" IMO). Are these banks replacing pSeries AIX boxes at branches, WebSphere instances for the consumer to hit, or z13 mainframes with CICS or other more-than-five-nines-accessibility with guaranteed zero data loss? (Pardon my ignorance - not a snarky quip but rather genuinely curious as to what, if any, out of the box solutions RH is pushing in that sector.)
To provide some experience I have as someone working for IBM implementing a solution which leverages docker at a large bank - no one I work with is naive enough to think that they can get away with containerising their current systems of record. Where we are significantly leveraging docker is on in-house bare metal clouds that we are using to build out middleware services.
We build all our images on top of rhel containers that have undergone hardening to meet internal + regulatory compliance. The end result is that to some extent we can say to developers "dont worry about getting your applications production ready right away, just write reasonably stable code quickly and we'll strip out all the stuff we dont trust you to do and handle it at the infrastructure level".
End result is that we can start to eliminate some of the unsuitable uses for traditional middleware systems like our datapower infrastructure that, while being great for specific use cases, is usually to difficult to work with for your average front end developer who doesnt give a damn about soap headers and broker clusters.
As far as our architecture leads are concerned, and I'm inclined to agree, docker is a great packaging format for developer outputs because it puts everything you need to run an application naked (ie no complex logging, HA, Networking etc.) Into a single versionable, reviewable, very much disposable asset. But it is not, and should not, be a replacement for proper systems of record that require any measure of stability.
We have been running Docker in production for around six months now, with a moderate IO load. We are also using persistent storage, but our persistent storage is NFS-mounted inside the container, so 99.5% of the filesystem IO in the container does not touch AUFS. (We are stuck on an old kernel at the moment, but evaluating upgrading to either 3.x or 4.x.)
So far, this has been stable for us, and based on your post it sounds like it will continue to be a viable strategy as we move to kernel 3.x or 4.x.
Netflix uses Docker in production. While they don't use it for everything yet, they are transitioning to it. Netflix accounts for 37% of internet traffic or more in North America.
Maybe you had trouble, but I think that those that aren't using containers need to be aware that many are using them successfully. Use of containers vs. VMs/dedicated servers can also result in reduced energy usage and can be less expensive.
I think people get hung up on what containers aren't and where they fail, rather than where they exceed.
But also be aware, that the traffic Netflix serves out of those containers is way less then 0.01% of the internet traffic in North America. The container cloud is their management engine.
The CDN that is actually pushing the traffic runs FreeBSD and would be using, if anything, jails.
Can I ask (given the summary / roadmap at the end and your blog post about GCE's pricing): Are you moving to Google Cloud? If not, why not?
For what it's worth, I agree with you that a lot of the value of Container Engine comes from having a team ensure the bits all work together. That should be (one of) the positives of any managed service, but I think the Container Engine team has done a particularly good job in a fast moving and often bumpy space.
We have stuff that must stay on AWS, too dangerous to make a move now :(
We'll ignore the "switch to auto scaling" bit. Teammates have already started testing CoreOS & Kubernetes (on AWS) while I was writing the article. We'll figure it out and get it in production soon, hopefully.
We have a subsidiary which has no locking on AWS and could use an infra refreshing. They already have google accounts and use one or two products. They'll be moved to GCE in the coming months, hopefully.
Thanks for the reply; we'll see you as you get out of the lockin (feel free to ping me via email if you've got anything specific).
Why drop autoscaling though? Note that you can use Autoscaling Groups with Kubernetes on EC2: https://github.com/kubernetes/contrib/blob/master/cluster-au... and the team would happily take bug reports (either as GH Issues or just yelling at them on slack/IRC).
It seems that doing auto scaling or doing kubernetes may take similar amount of work (in our current state of affairs). We want kubernetes on the long term so we might as well go straight to it.
I suppose that auto scaling will be back on the table later, to auto scale kubernetes instances. Maybe.
Based on past discussions on HN and I feel google has huge terrible support for it's offerings. May be it will improve with time. But currently I am really concerned with running GCP in production.
As others (including downthread) have pointed out, Google definitely has a bad overall support reputation, but that's because nearly every (traditional) Google service is free and lacks this structured support model. Just like Cloud, if you're even a moderately large Ads customer, you get pretty good support!
Disclosure: I work on Google Cloud and want your business ;).
I wonder if the problem is more that Google has got terrible support for it's free offerings - maybe for justifiable business reasons. However the damage to their brand when they want to sell business critical services is an unintended consequence. (Similarly I think killing Reader was justifiable from their perspective but the damage to their reputation amongst a vociferous and possibly influential demographic has probably outweighed the cost of keeping it going... They can't launch a service without someone popping up to remind us about Reader)
I miss Reader (though I'm amused that it let Feedly take off, and they're a Cloud customer!).
One thing often not considered ("Why not just have someone keep it running?") is that you really have to keep a team of people on it (in case there's a CVE or something) or at least familiar enough with the code to fix it. There's also the double standard of "you haven't added new features in forever!" (Maybe this wouldn't have applied to Reader though).
But, I agree if we could have kept it on life support somehow, we wouldn't have (as many) people asking "What if they shut down Cloud?!?". Conveniently, as people get a sense of how serious Google Cloud is about this business, even on HN I'm seeing this less.
A bit of context missing from the article. We are a small shop with a few hundreds servers. At core, we're running a financial system moving around multi-million dollars per day (or billions per year).
It's fair to say that we have higher expectations than average and we take production issues rather (too?) seriously.
I updated and fixed a few points mentioned in the comments. (Notably: CoreOS).
Overall, it's "normal" that you didn't experience all of these issues if you're not using docker at scale in production and/or if you didn't use it for long.
I'd like to point out that these are issues and workarounds happening over a period of [more than] a year, summarized all together in a 10 minutes read. It does amplify the dramatic and painful aspect.
Anyway, the issues from the past are already in the past. The most important section is the Roadmap. That's what you need to know to run Docker (or use auto scaling groups instead).